The math, plainly.
Most "can my machine run this LLM" tools list models and call it done. They show a green checkmark next to a 70B model on a 24GB card and hope you don't ask why. This is the math they skipped.
Every parameter in a transformer occupies a fixed number of bits in memory. Multiply them. Divide by eight. The result is gigabytes. That's the entire foundation. Every other line item — KV cache, activations, framework overhead, allocator slack — is just adding to a number you already had.
The rest of this article walks the additions. Skip to the section that's costing you the most.
01 Weight size: the floor
A 70-billion-parameter model loaded in 16-bit precision weighs 70 × 2 = 140 GB. Loaded in 8-bit it weighs 70. In 4-bit, 35. The math doesn't care about the architecture, the training data, or the marketing copy. It cares about two numbers: how many parameters, and how many bits each one occupies.
Quantization formats like Q4_K_M or AWQ are not exactly 4 bits per parameter — they carry small metadata overheads — but the rule of thumb (params × bits ÷ 8) is accurate to within a few percent. Treat it as gospel for napkin math and you'll be fine.
Q8_0 → 74.4 GB
Q5_K_M → 48.1 GB
Q4_K_M → 39.4 GB
Q3_K_M → 30.6 GB
02 KV cache: the context tax
Every token your model generates writes a key and a value into every attention layer. These accumulate. At ctx=2K nobody notices. At ctx=32K with concurrency 4, the KV cache eats more VRAM than the weights.
The exact formula depends on the architecture (layers × heads × head dimension; modern open models use GQA which cuts KV by 30-60%), but the practical version is: budget 1–4 GB of KV per concurrent request at long context, on a 70B-class model. More for larger models, less for smaller.
This is the single most under-counted line item in every "fit check" tool on the internet. If your tool doesn't ask you for context length, throw it out.
03 Framework overhead: the constant
CUDA context, kernel workspaces, paged-attention buffers, allocator fragmentation, the runtime itself. None of this shows up in the model card. All of it is real.
Rule of thumb: 1.5 GB fixed, plus 3–5% of total device VRAM, depending on backend. vLLM, llama.cpp, TGI, and exllamav2 each have their own constants. We use a generous middle estimate.
$ THE VRAM TAX NOBODY TALKS ABOUT
By the time you've paid the runtime, the KV cache, and a sane safety margin, you have considerably less. Here is the bill on a stock RTX 4090 at ctx=8K, concurrency 1, 15% safety:
- -3.6 GB safety headroom (15% of 24)
- -2.2 GB framework overhead (1.5 + 4% of 24)
- -0.8 GB kv cache (ctx 8K, 1 req)
- 17.4 GB ← what's actually left for weights
04 Concurrency: the multiplier
Every concurrent request gets its own KV cache. Two requests, double the KV. Eight requests, eight times. This is why production inference servers run smaller models than your local llama.cpp setup — they need the headroom for batching. See continuous batching for how vLLM and oMLX claw it back.
If you're serving one request at a time, ignore this. If you're building anything that fans out, this is the only number that matters.
05 The full equation
total_vram × (1 − safety)
− kv_cache × concurrency
− framework_overhead
That's the calculator on the homepage. It's not magic. It's not a model. It's three subtractions and one multiplication. Anyone who tells you it needs to be more complicated is selling you something.
06 What this site doesn't do
It does not benchmark tokens-per-second. It does not predict accuracy degradation from quantization. It does not tell you which model to use. There are good tools for those things. This one answers exactly one question: will the weights, the cache, and the runtime fit in your physical RAM. That's it. That's the whole product.
// new to the terms?Every dotted word above links to a full definition on /glossary. Open it in another tab and skim the index; the whole vocabulary of local inference is roughly 23 terms.
// next: $ run the calculator →