vLLM

~/runtime/vllm

vLLM

Production-grade LLM serving. PagedAttention, continuous batching, OpenAI-compatible API.

License

Apache 2.0

Platform

NVIDIA CUDA (server)

Model formats

safetensors · GGUF · AWQ · GPTQ · FP8 · compressed-tensors

API

OpenAI · Anthropic Messages · gRPC

What it is.

$ ./vrambudget --runtime vllm

High-throughput inference engine from UC Berkeley Sky Lab, now driven by 2000+ contributors. PagedAttention manages KV memory in blocks for efficient sharing and reuse. Supports 200+ HuggingFace model architectures across NVIDIA, AMD, Intel Gaudi, TPUs, Apple Silicon, and CPU. The default choice when you need to serve many requests concurrently and the answer to "what powers production AI infra."

Install.

$ pkg install vllm

pip install vllm

docker pull vllm/vllm-openai

Supported platforms: NVIDIA GPUs, AMD GPUs, Apple Silicon, Intel Gaudi, Google TPUs, x86/ARM/PowerPC CPUs

Features.

$ cat features.md

PagedAttention

Block-based KV cache management for efficient memory sharing. Cuts wasted VRAM under high concurrency.

Continuous batching

Incoming requests join the running batch immediately. Up to 23x throughput vs naive serving (Anyscale benchmarks).

200+ models

Llama, Qwen, Gemma, Mixtral, DeepSeek, GPT-OSS, Pixtral, Qwen-VL, E5-Mistral, ColBERT, more.

Multi-hardware

NVIDIA, AMD, Intel Gaudi, Google TPUs, Apple Silicon, IBM Spyre, Huawei Ascend, Rebellions NPU.

Speculative decoding

n-gram, suffix, EAGLE, DFlash. Lower latency without sacrificing quality.

Multi-LoRA

Serve multiple LoRA adapters concurrently against one base model. Critical for personalized inference at scale.

Best for

▸Production inference at scale (multiple concurrent users)
▸Distributed inference: tensor / pipeline / data / expert / context parallelism
▸OpenAI-compatible drop-in for replacing GPT calls with self-hosted
▸Anything serving thousands of requests per minute

Caveats

▸Not a single-binary CLI; setup is heavier than Ollama or LM Studio
▸Mostly CUDA-first; non-NVIDIA backends still maturing in spots
▸Overkill for single-user local desktop usage

What it is.

Install.

Features.

Links.

Compare to…

Discussion.