~/runtime/vllm
vLLM

vLLM

Production-grade LLM serving. PagedAttention, continuous batching, OpenAI-compatible API.

License
Apache 2.0
Platform
NVIDIA CUDA (server)
Model formats
safetensors · GGUF · AWQ · GPTQ · FP8 · compressed-tensors
API
OpenAI · Anthropic Messages · gRPC

What it is.

$ ./vrambudget --runtime vllm

High-throughput inference engine from UC Berkeley Sky Lab, now driven by 2000+ contributors. PagedAttention manages KV memory in blocks for efficient sharing and reuse. Supports 200+ HuggingFace model architectures across NVIDIA, AMD, Intel Gaudi, TPUs, Apple Silicon, and CPU. The default choice when you need to serve many requests concurrently and the answer to "what powers production AI infra."

Install.

$ pkg install vllm
pip install vllm
docker pull vllm/vllm-openai

Supported platforms: NVIDIA GPUs, AMD GPUs, Apple Silicon, Intel Gaudi, Google TPUs, x86/ARM/PowerPC CPUs

Features.

$ cat features.md
PagedAttention

Block-based KV cache management for efficient memory sharing. Cuts wasted VRAM under high concurrency.

Continuous batching

Incoming requests join the running batch immediately. Up to 23x throughput vs naive serving (Anyscale benchmarks).

200+ models

Llama, Qwen, Gemma, Mixtral, DeepSeek, GPT-OSS, Pixtral, Qwen-VL, E5-Mistral, ColBERT, more.

Multi-hardware

NVIDIA, AMD, Intel Gaudi, Google TPUs, Apple Silicon, IBM Spyre, Huawei Ascend, Rebellions NPU.

Speculative decoding

n-gram, suffix, EAGLE, DFlash. Lower latency without sacrificing quality.

Multi-LoRA

Serve multiple LoRA adapters concurrently against one base model. Critical for personalized inference at scale.

Best for
  • Production inference at scale (multiple concurrent users)
  • Distributed inference: tensor / pipeline / data / expert / context parallelism
  • OpenAI-compatible drop-in for replacing GPT calls with self-hosted
  • Anything serving thousands of requests per minute
Caveats
  • Not a single-binary CLI; setup is heavier than Ollama or LM Studio
  • Mostly CUDA-first; non-NVIDIA backends still maturing in spots
  • Overkill for single-user local desktop usage

Links.

$ ls -1 ./external
↗ homepagehttps://docs.vllm.ai/en/latest/↗ githubhttps://github.com/vllm-project/vllm↗ docshttps://docs.vllm.ai/en/latest/

Compare to…

$ ./vrambudget --compare-runtimes

Discussion.

$ gh discussion list

// sign in with github to leave a comment. threads live in the repo's discussions tab.