honest math for local llms · open source · MIT

What LLM fits
my hardware?

$ answer ↓the math, in one line

VRAM ≈ params × (bits ÷ 8)

Honest VRAM math for local LLMs. KV cache, activations, framework overhead, and concurrency are what actually crash your runs. This site makes that visible.

Run the math↓Read the explainer→

42 GPU presets30 curated models5 hosting runtimesNEW share to help spread the word

Plug in your hardware. See what actually fits.

$ ./vrambudget --interactive

$ vrambudget --gpu rtx-4090 --ctx 8192 --conc 1 --safety 15%↗ tweetlive

ada

RTX 4060

8GB

ada

RTX 4060 Ti 16GB

16GB

ada

RTX 4070

12GB

ada

RTX 4070 Super

12GB

ada

RTX 4070 Ti Super

16GB

ada

RTX 4080

16GB

ada

RTX 4080 Super

16GB

ada · flagship

RTX 4090

24GB

24GB

System RAM

64GB

8Ktok

1req

Safety headroom

15%

24GB

device capacity

0.05GB

0.2% of total

03 Runtime overhead

1.6GB

6.7% of total

04 Weights budget

19GB

78% of total

$ budget allocation20 / 24 GB used

weightskv cacheoverheadsafety

↳ sorted by best fit

fitscomfortably runs on this budget21 models

Qwen 3.6 35B A3B35B

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

19 GB

fits

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

18 GB

fits

Qwen 2.5 32B32.5B

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

18 GB

fits

Qwen 2.5 Coder 32B32.5B

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

18 GB

fits

Qwen3 30B A3B30.5B

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

17 GB

fits

Qwen 3.6 27B27B

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

19 GB

fits

Gemma 4 26B A4B26B

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

18 GB

fits

Mistral Small 324B

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

17 GB

fits

gpt-oss 20B20.9B

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

17 GB

fits

StarCoder2 15B15B

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

16 GB

fits

Phi-414.7B

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

16 GB

fits

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

18 GB

fits

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

18 GB

fits

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

16 GB

fits

Granite 8B Code8B

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

16 GB

fits

Mistral 7B v0.37.2B

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

14 GB

fits

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

14 GB

fits

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

8.0 GB

fits

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

7.6 GB

fits

Llama 3.2 3B3.21B

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

6.4 GB

fits

Llama 3.2 1B1.23B

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

2.5 GB

fits

overneeds a bigger card, more aggressive quant, or model split9 models

Mixtral 8x7B46.7B

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

26 GB

over

Llama 3.3 70B70.6B

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

40 GB

over

Qwen 2.5 72B72.7B

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

41 GB

over

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

59 GB

over

gpt-oss 120B117B

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

66 GB

over

Mixtral 8x22B141B

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

79 GB

over

Llama 3.1 405B405B

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

228 GB

over

DeepSeek V3671B

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

377 GB

over

DeepSeek R1671B

FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit

377 GB

over

The math, plainly.

$ cat the-math.md | head -20

01 / weights

Weight size is the floor, not the ceiling.

Every parameter occupies a fixed number of bits. Multiply, divide by eight, and you have gigabytes. Quantization is just a smaller multiplier. A 70B model at Q4_K_M is not magic; it is 70 × 4.5 ÷ 8.

// gigabytes of weights
weights = params × bits ÷ 8

// llama-3.1-70b @ Q4_K_M
weights = 70 × 4.5 ÷ 8
        = 39.4 GB

02 / kv cache

The KV cache is what eats your context window.

Every token you generate writes a key and a value into every attention layer. Context length and concurrency both multiply it. This is why your 24GB card runs Llama 70B Q4 fine until you set ctx=32K and watch it OOM mid-generation.

// per-request KV bytes
kv = 2 × layers × heads ×
     head_dim × ctx × bytes

// total at runtime
total = kv × concurrent_requests

03 / overhead

Framework overhead is a constant tax.

CUDA context, kernel workspaces, allocator slack, paged-attention buffers. Roughly 1–2 GB on a cold load, plus 3–5% of the device. It is not optional. It is not optimizable. Budget for it.

// rule of thumb
overhead = 1.5 GB + total_vram × 0.04

// what's left for weights
budget = total × (1 − safety)
       − kv − overhead

Read the full explainer →

why this exists

We run an MSP in Austin. Every week a client asks the same question: "can my machine run this?" Every benchmark site answers it with a vibes-based fit indicator and an affiliate link to a 4090. That's not a budget. That's marketing.

So we built the thing we wished existed: the math, then the models. No ranking algorithm. No SEO sludge. No surprise OOMs at 3am. Plug in your hardware, see the numbers, decide for yourself.

// built by Titanium Computing · Austin, TX · Frontier Operations