~/runtime/omlx
oMLX

oMLX

MLX inference server for Apple Silicon. Paged SSD KV cache, menu-bar app, OpenAI + Anthropic API.

License
Apache 2.0
Platform
macOS Sequoia · M1 / M2 / M3 / M4 / M5
Model formats
MLX
API
OpenAI /v1/chat/completions · Anthropic /v1/messages · Embeddings · Rerank

What it is.

$ ./vrambudget --runtime omlx

Native macOS inference server built on MLX. Solves the biggest pain in agentic local inference: when the KV cache invalidates mid-session (which it does constantly with coding agents), oMLX restores cached prefix blocks from SSD in milliseconds instead of recomputing from scratch. TTFT from 30-90s down to under 5s on the second turn. Drop-in for Claude Code, OpenClaw, Cursor. Native menu-bar app, not Electron. Apache 2.0.

Install.

$ pkg install omlx
brew tap jundot/omlx https://github.com/jundot/omlx && brew install omlx
Or download the signed .dmg from omlx.ai (menu-bar app)

Supported platforms: macOS 15+ (Apple Silicon)

Features.

$ cat features.md
Paged SSD KV cache

Two-tier cache: hot blocks in RAM, cold blocks on SSD in safetensors. Prefixes survive cache eviction AND server restarts. Recomputed from scratch only when truly novel.

Continuous batching

mlx-lm BatchGenerator under the hood. M3 Ultra 512GB hits 190 tok/s at 8x concurrency (3.36x speedup over single-request).

OpenAI + Anthropic API

Both surfaces native. Web dashboard generates the exact config command for Claude Code, OpenClaw, Cursor, Codex, Hermes, Copilot.

Multi-model serving

LLMs, VLMs, embeddings, rerankers loaded simultaneously. LRU eviction + per-model TTL + manual pin/unload from the admin panel.

Tool calling + MCP

JSON / Qwen / Gemma / GLM / MiniMax tool formats auto-detected. Native MCP integration. Tool-result trimming for oversized outputs.

Native menu-bar app

PyObjC menu-bar app (not Electron). Signed, notarized, in-app auto-update. Persistent serving stats across restarts.

Best for
  • Agentic coding on a Mac: Claude Code, OpenClaw, Cursor with sub-5s TTFT after the first turn
  • Anyone with a Mac running long agent sessions where context shifts often
  • Multi-model serving (LLM + VLM + embedding + reranker simultaneously)
  • Drop-in replacement for cloud APIs while staying private
Caveats
  • macOS Sequoia (15+) and Apple Silicon only; no Linux or Intel Mac path
  • MLX-format models only; bring your own from huggingface.co/mlx-community
  • Reuses LM Studio model directories if you already have them, but does NOT auto-import from llama.cpp / GGUF

Links.

$ ls -1 ./external
↗ homepagehttps://omlx.ai↗ githubhttps://github.com/jundot/omlx↗ docshttps://github.com/jundot/omlx#readme

Compare to…

$ ./vrambudget --compare-runtimes

Discussion.

$ gh discussion list

// sign in with github to leave a comment. threads live in the repo's discussions tab.