oMLX

~/runtime/omlx

oMLX

MLX inference server for Apple Silicon. Paged SSD KV cache, menu-bar app, OpenAI + Anthropic API.

License

Apache 2.0

Platform

macOS Sequoia · M1 / M2 / M3 / M4 / M5

Model formats

MLX

API

OpenAI /v1/chat/completions · Anthropic /v1/messages · Embeddings · Rerank

What it is.

$ ./vrambudget --runtime omlx

Native macOS inference server built on MLX. Solves the biggest pain in agentic local inference: when the KV cache invalidates mid-session (which it does constantly with coding agents), oMLX restores cached prefix blocks from SSD in milliseconds instead of recomputing from scratch. TTFT from 30-90s down to under 5s on the second turn. Drop-in for Claude Code, OpenClaw, Cursor. Native menu-bar app, not Electron. Apache 2.0.

Install.

$ pkg install omlx

brew tap jundot/omlx https://github.com/jundot/omlx && brew install omlx

Or download the signed .dmg from omlx.ai (menu-bar app)

Supported platforms: macOS 15+ (Apple Silicon)

Features.

$ cat features.md

Paged SSD KV cache

Two-tier cache: hot blocks in RAM, cold blocks on SSD in safetensors. Prefixes survive cache eviction AND server restarts. Recomputed from scratch only when truly novel.

Continuous batching

mlx-lm BatchGenerator under the hood. M3 Ultra 512GB hits 190 tok/s at 8x concurrency (3.36x speedup over single-request).

OpenAI + Anthropic API

Both surfaces native. Web dashboard generates the exact config command for Claude Code, OpenClaw, Cursor, Codex, Hermes, Copilot.

Multi-model serving

LLMs, VLMs, embeddings, rerankers loaded simultaneously. LRU eviction + per-model TTL + manual pin/unload from the admin panel.

Tool calling + MCP

JSON / Qwen / Gemma / GLM / MiniMax tool formats auto-detected. Native MCP integration. Tool-result trimming for oversized outputs.

Native menu-bar app

PyObjC menu-bar app (not Electron). Signed, notarized, in-app auto-update. Persistent serving stats across restarts.

Best for

▸Agentic coding on a Mac: Claude Code, OpenClaw, Cursor with sub-5s TTFT after the first turn
▸Anyone with a Mac running long agent sessions where context shifts often
▸Multi-model serving (LLM + VLM + embedding + reranker simultaneously)
▸Drop-in replacement for cloud APIs while staying private

Caveats

▸macOS Sequoia (15+) and Apple Silicon only; no Linux or Intel Mac path
▸MLX-format models only; bring your own from huggingface.co/mlx-community
▸Reuses LM Studio model directories if you already have them, but does NOT auto-import from llama.cpp / GGUF

What it is.

Install.

Features.

Links.

Compare to…

Discussion.