Native macOS inference server built on MLX. Solves the biggest pain in agentic local inference: when the KV cache invalidates mid-session (which it does constantly with coding agents), oMLX restores cached prefix blocks from SSD in milliseconds instead of recomputing from scratch. TTFT from 30-90s down to under 5s on the second turn. Drop-in for Claude Code, OpenClaw, Cursor. Native menu-bar app, not Electron. Apache 2.0.
Supported platforms: macOS 15+ (Apple Silicon)
Two-tier cache: hot blocks in RAM, cold blocks on SSD in safetensors. Prefixes survive cache eviction AND server restarts. Recomputed from scratch only when truly novel.
mlx-lm BatchGenerator under the hood. M3 Ultra 512GB hits 190 tok/s at 8x concurrency (3.36x speedup over single-request).
Both surfaces native. Web dashboard generates the exact config command for Claude Code, OpenClaw, Cursor, Codex, Hermes, Copilot.
LLMs, VLMs, embeddings, rerankers loaded simultaneously. LRU eviction + per-model TTL + manual pin/unload from the admin panel.
JSON / Qwen / Gemma / GLM / MiniMax tool formats auto-detected. Native MCP integration. Tool-result trimming for oversized outputs.
PyObjC menu-bar app (not Electron). Signed, notarized, in-app auto-update. Persistent serving stats across restarts.
// sign in with github to leave a comment. threads live in the repo's discussions tab.