@meta
  v: 1
  route: /glossary
  generated: 2026-06-10T07:03:33.039Z

@intent
  purpose:    Explain the technical terms used in local LLM memory budgeting. Educational reference for the calculator and the math page.
  audience:   ai-newcomer, ai-engineer, self-hoster, student
  capability: define_term, jump_to_anchor, follow_related

@state
  total_terms: 23
  terms[23]{slug,term,one_liner,related,anchor}:
    vram,VRAM,Video RAM. The memory on a GPU. The hard ceiling for what a model can load.,"kv-cache,framework-overhead,safety-headroom",/glossary#vram
    parameters,Parameters,The numbers the model learned. Reported in billions (B). The first factor in VRAM ≈ params × bits ÷ 8.,"moe,quantization,context-window",/glossary#parameters
    quantization,Quantization,Storing each parameter in fewer bits. The single biggest lever for fitting a big model on small hardware.,"parameters,vram,awq,gguf",/glossary#quantization
    kv-cache,KV cache,Memory for past tokens. Grows with context length and concurrency. The silent VRAM killer.,"context-window,concurrency,gqa,mqa,paged-attention",/glossary#kv-cache
    gqa,GQA (Grouped Query Attention),Many query heads share a smaller set of key/value heads. Cuts KV cache 30-60%.,"kv-cache,mqa,parameters",/glossary#gqa
    mqa,MQA (Multi-Query Attention),"A single shared K/V pair across all Q heads. Even smaller KV; older Mistral, PaLM use it.","gqa,kv-cache",/glossary#mqa
    moe,MoE (Mixture of Experts),"Sparse models with many experts; only a few activate per token. Total params for memory, active params for compute.","parameters,expert-parallel",/glossary#moe
    context-window,Context window,How many tokens the model can attend to at once. Limits conversation length and document size.,"kv-cache,concurrency,parameters",/glossary#context-window
    concurrency,Concurrency,Parallel requests on one model. Each request gets its own KV cache; total memory scales linearly.,"kv-cache,paged-attention,continuous-batching",/glossary#concurrency
    framework-overhead,Framework overhead,"CUDA context, kernel workspaces, allocator slack. 1.5-2.5 GB you cannot get back.","vram,safety-headroom",/glossary#framework-overhead
    safety-headroom,Safety headroom,"The fraction of VRAM you refuse to spend. Buffers fragmentation, spikes, and surprises.","vram,framework-overhead",/glossary#safety-headroom
    paged-attention,PagedAttention,Block-based KV cache management. vLLM's trick for reducing wasted memory under high concurrency.,"kv-cache,continuous-batching,concurrency",/glossary#paged-attention
    continuous-batching,Continuous batching,New requests join the running batch immediately instead of waiting for it to finish. Up to 23x throughput.,"paged-attention,concurrency",/glossary#continuous-batching
    speculative-decoding,Speculative decoding,"A small draft model proposes tokens; the big model verifies in parallel. Faster decoding, same output.",continuous-batching,/glossary#speculative-decoding
    tensor-parallel,Tensor parallelism,Split each layer's matrix multiplication across multiple GPUs. Standard way to run models bigger than one card.,"pipeline-parallel,expert-parallel",/glossary#tensor-parallel
    pipeline-parallel,Pipeline parallelism,Split layers across GPUs; requests flow through like an assembly line.,"tensor-parallel,expert-parallel",/glossary#pipeline-parallel
    expert-parallel,Expert parallelism,For MoE models: shard experts across GPUs. Each GPU holds a subset; routing decides where each token goes.,"moe,tensor-parallel",/glossary#expert-parallel
    activations,Activations,The transient values inside the forward pass. Counted separately from weights and KV cache.,"flash-attention,framework-overhead",/glossary#activations
    flash-attention,Flash Attention,"A faster, more memory-efficient attention kernel. Standard in every modern runtime.","activations,kv-cache",/glossary#flash-attention
    tokens-per-second,Tokens/sec (TPS),"How fast the model generates output. Bandwidth-bound for big models, compute-bound for small.","ttft,kv-cache",/glossary#tokens-per-second
    ttft,TTFT (Time To First Token),How long from request to first generated token. Dominated by prefill on long prompts.,"tokens-per-second,paged-attention,kv-cache",/glossary#ttft
    awq,AWQ / GPTQ,GPU-optimized 4-bit quantization formats. AWQ for activation-aware; GPTQ for one-shot post-training.,"quantization,gguf",/glossary#awq
    gguf,GGUF,The model file format llama.cpp ecosystem uses. Carries the model + tokenizer + metadata in one file.,"quantization,awq",/glossary#gguf

@actions
  - id: jump_to_vram
    method: GET
    href: /glossary#vram
  - id: jump_to_parameters
    method: GET
    href: /glossary#parameters
  - id: jump_to_quantization
    method: GET
    href: /glossary#quantization
  - id: jump_to_kv_cache
    method: GET
    href: /glossary#kv-cache
  - id: jump_to_gqa
    method: GET
    href: /glossary#gqa
  - id: jump_to_mqa
    method: GET
    href: /glossary#mqa
  - id: view_math
    method: GET
    href: /the-math
  - id: view_calculator
    method: GET
    href: /#calculator

@context
  > A single-page glossary covering every term used in the VRAM budgeting math. Designed for readers who have read about local LLM inference and want plain-English definitions of KV cache, GQA, MoE, paged attention, quantization formats (Q4_K_M, AWQ, GPTQ, GGUF), context window, concurrency, framework overhead, safety headroom, tensor / pipeline / expert parallelism, flash attention, speculative decoding, tokens/sec, and TTFT.

@nav
  self:      /glossary
  parents:   [/]
  peers:     [/the-math, /calc]
  drilldown: /the-math
