@meta
  v: 1
  route: /runtime/vllm
  generated: 2026-06-10T09:09:52.499Z

@intent
  purpose:    Install, configure, and serve LLMs with vLLM.
  audience:   self-hoster, ai-engineer, mac-user, devops-engineer
  capability: install, serve_local_llms, compare_runtimes, open_external_docs

@state
  slug: vllm
  name: vLLM
  family: server
  type: server
  license: Apache 2.0
  primary_platform: NVIDIA CUDA (server)
  platforms[6]: NVIDIA GPUs, AMD GPUs, Apple Silicon, Intel Gaudi, Google TPUs, x86/ARM/PowerPC CPUs
  model_formats[6]: safetensors, GGUF, AWQ, GPTQ, FP8, compressed-tensors
  api_compatibility[3]: OpenAI, Anthropic Messages, gRPC
  install_command: pip install vllm
  install_secondary: docker pull vllm/vllm-openai
  homepage_url: https://docs.vllm.ai/en/latest/
  github_url: https://github.com/vllm-project/vllm
  docs_url: https://docs.vllm.ai/en/latest/
  feature_count: 6
  feature_labels[6]: PagedAttention, Continuous batching, 200+ models, Multi-hardware, Speculative decoding, Multi-LoRA
  best_for[4]: Production inference at scale (multiple concurrent users), Distributed inference: tensor / pipeline / data / expert / context parallelism, OpenAI-compatible drop-in for replacing GPT calls with self-hosted, Anything serving thousands of requests per minute
  caveats[3]: Not a single-binary CLI; setup is heavier than Ollama or LM Studio, Mostly CUDA-first; non-NVIDIA backends still maturing in spots, Overkill for single-user local desktop usage

@actions
  - id: open_homepage
    method: GET
    href: https://docs.vllm.ai/en/latest/
  - id: open_github
    method: GET
    href: https://github.com/vllm-project/vllm
  - id: open_docs
    method: GET
    href: https://docs.vllm.ai/en/latest/
  - id: view_index
    method: GET
    href: /runtime/
  - id: view_calculator
    method: GET
    href: /#calculator

@context
  > High-throughput inference engine from UC Berkeley Sky Lab, now driven by 2000+ contributors. PagedAttention manages KV memory in blocks for efficient sharing and reuse. Supports 200+ HuggingFace model architectures across NVIDIA, AMD, Intel Gaudi, TPUs, Apple Silicon, and CPU. The default choice when you need to serve many requests concurrently and the answer to "what powers production AI infra."

@nav
  self:      /runtime/vllm
  parents:   [/, /runtime/]
  peers:     [/runtime/ollama, /runtime/lm-studio, /runtime/mlx, /runtime/omlx]
  drilldown: /#calculator