High-throughput inference engine from UC Berkeley Sky Lab, now driven by 2000+ contributors. PagedAttention manages KV memory in blocks for efficient sharing and reuse. Supports 200+ HuggingFace model architectures across NVIDIA, AMD, Intel Gaudi, TPUs, Apple Silicon, and CPU. The default choice when you need to serve many requests concurrently and the answer to "what powers production AI infra."
Supported platforms: NVIDIA GPUs, AMD GPUs, Apple Silicon, Intel Gaudi, Google TPUs, x86/ARM/PowerPC CPUs
Block-based KV cache management for efficient memory sharing. Cuts wasted VRAM under high concurrency.
Incoming requests join the running batch immediately. Up to 23x throughput vs naive serving (Anyscale benchmarks).
Llama, Qwen, Gemma, Mixtral, DeepSeek, GPT-OSS, Pixtral, Qwen-VL, E5-Mistral, ColBERT, more.
NVIDIA, AMD, Intel Gaudi, Google TPUs, Apple Silicon, IBM Spyre, Huawei Ascend, Rebellions NPU.
n-gram, suffix, EAGLE, DFlash. Lower latency without sacrificing quality.
Serve multiple LoRA adapters concurrently against one base model. Critical for personalized inference at scale.
// sign in with github to leave a comment. threads live in the repo's discussions tab.