honest math for local llms · open source · MIT

What LLM fits
my hardware?

$ answer ↓the math, in one line

VRAM params × (bits ÷ 8)

Honest VRAM math for local LLMs. KV cache, activations, framework overhead, and concurrency are what actually crash your runs. This site makes that visible.

42 GPU presets30 curated models5 hosting runtimesNEW share to help spread the word

Plug in your hardware. See what actually fits.

$ ./vrambudget --interactive
$ vrambudget --gpu rtx-4090 --ctx 8192 --conc 1 --safety 15%↗ tweetlive
ada
RTX 4060
8GB
ada
RTX 4060 Ti 16GB
16GB
ada
RTX 4070
12GB
ada
RTX 4070 Super
12GB
ada
RTX 4070 Ti Super
16GB
ada
RTX 4080
16GB
ada
RTX 4080 Super
16GB
ada · flagship
RTX 4090
24GB
24GB
64GB
8Ktok
24GB
device capacity
0.05GB
0.2% of total
1.6GB
6.7% of total
19GB
78% of total
$ budget allocation20 / 24 GB used
weightskv cacheoverheadsafety
↳ sorted by best fit
fitscomfortably runs on this budget21 models
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
19 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
18 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
18 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
18 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
17 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
19 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
18 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
17 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
17 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
16 GB
fits
Phi-414.7B
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
16 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
18 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
18 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
16 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
16 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
14 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
14 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
8.0 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
7.6 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
6.4 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
2.5 GB
fits
overneeds a bigger card, more aggressive quant, or model split9 models
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
26 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
40 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
41 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
59 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
66 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
79 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
228 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
377 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
377 GB
over

The math, plainly.

$ cat the-math.md | head -20
01 / weights

Weight size is the floor, not the ceiling.

Every parameter occupies a fixed number of bits. Multiply, divide by eight, and you have gigabytes. Quantization is just a smaller multiplier. A 70B model at Q4_K_M is not magic; it is 70 × 4.5 ÷ 8.

// gigabytes of weights
weights = params × bits ÷ 8

// llama-3.1-70b @ Q4_K_M
weights = 70 × 4.5 ÷ 8
        = 39.4 GB
02 / kv cache

The KV cache is what eats your context window.

Every token you generate writes a key and a value into every attention layer. Context length and concurrency both multiply it. This is why your 24GB card runs Llama 70B Q4 fine until you set ctx=32K and watch it OOM mid-generation.

// per-request KV bytes
kv = 2 × layers × heads ×
     head_dim × ctx × bytes

// total at runtime
total = kv × concurrent_requests
03 / overhead

Framework overhead is a constant tax.

CUDA context, kernel workspaces, allocator slack, paged-attention buffers. Roughly 1–2 GB on a cold load, plus 3–5% of the device. It is not optional. It is not optimizable. Budget for it.

// rule of thumb
overhead = 1.5 GB + total_vram × 0.04

// what's left for weights
budget = total × (1 − safety)
       − kv − overhead

Read the full explainer →

We run an MSP in Austin. Every week a client asks the same question: "can my machine run this?" Every benchmark site answers it with a vibes-based fit indicator and an affiliate link to a 4090. That's not a budget. That's marketing.

So we built the thing we wished existed: the math, then the models. No ranking algorithm. No SEO sludge. No surprise OOMs at 3am. Plug in your hardware, see the numbers, decide for yourself.

// built by Titanium Computing · Austin, TX · Frontier Operations