VRAM ≈ params × (bits ÷ 8)
Honest VRAM math for local LLMs. KV cache, activations, framework overhead, and concurrency are what actually crash your runs. This site makes that visible.
Every parameter occupies a fixed number of bits. Multiply, divide by eight, and you have gigabytes. Quantization is just a smaller multiplier. A 70B model at Q4_K_M is not magic; it is 70 × 4.5 ÷ 8.
// gigabytes of weights weights = params × bits ÷ 8 // llama-3.1-70b @ Q4_K_M weights = 70 × 4.5 ÷ 8 = 39.4 GB
Every token you generate writes a key and a value into every attention layer. Context length and concurrency both multiply it. This is why your 24GB card runs Llama 70B Q4 fine until you set ctx=32K and watch it OOM mid-generation.
// per-request KV bytes kv = 2 × layers × heads × head_dim × ctx × bytes // total at runtime total = kv × concurrent_requests
CUDA context, kernel workspaces, allocator slack, paged-attention buffers. Roughly 1–2 GB on a cold load, plus 3–5% of the device. It is not optional. It is not optimizable. Budget for it.
// rule of thumb overhead = 1.5 GB + total_vram × 0.04 // what's left for weights budget = total × (1 − safety) − kv − overhead
We run an MSP in Austin. Every week a client asks the same question: "can my machine run this?" Every benchmark site answers it with a vibes-based fit indicator and an affiliate link to a 4090. That's not a budget. That's marketing.
So we built the thing we wished existed: the math, then the models. No ranking algorithm. No SEO sludge. No surprise OOMs at 3am. Plug in your hardware, see the numbers, decide for yourself.