~/gpu/rtx-5070
nvidia manufacturer

RTX 5070 12GB

12GB Blackwell. Faster than 4070; same VRAM ceiling.

VRAM
12GB
Bandwidth
672GB/s
FP16 compute
380TFLOPS
Budget @ ctx 8K
8.3GB

Tuned to this card.

$ ./vrambudget --gpu rtx-5070
$ vrambudget --gpu rtx-5070 --ctx 8192 --conc 1 --safety 15%↗ tweetlive
blackwell
RTX 5070
12GB
blackwell
RTX 5070 Ti
16GB
blackwell
RTX 5080
16GB
blackwell · flagship
RTX 5090
32GB
12GB
64GB
8Ktok
12GB
device capacity
0.05GB
0.4% of total
1.3GB
10.8% of total
8.8GB
74% of total
$ budget allocation10 / 12 GB used
weightskv cacheoverheadsafety
↳ sorted by best fit
fitscomfortably runs on this budget12 models
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
8.4 GB
fits
Phi-414.7B
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
8.3 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
7.4 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
7.4 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
8.5 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
8.5 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
7.7 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
7.4 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
8.0 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
7.6 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
6.4 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
2.5 GB
fits
overneeds a bigger card, more aggressive quant, or model split18 models
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
12 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
14 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
15 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
15 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
17 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
18 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
18 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
19 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
20 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
26 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
40 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
41 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
59 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
66 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
79 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
228 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
377 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
377 GB
over

Models that fit on a RTX 5070.

$ grep "fits" models.json | head -12
ModelParamsBest quantWeights / 8.3 GB budgetFit
StarCoder2 15B15BAWQ 4-BIT
8.0
fits
▸ show the math
// weights AWQ 4-bit for StarCoder2 15B (15B params)
weights = params × bits ÷ 8
        = 15 × 4.25 ÷ 8
        = 7.97 GB

// budget on RTX 5070 (12GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.30 GB    (runtime, cuda, allocator)
safety    = 1.80 GB    (15% of 12GB)
budget    = vram − safety − kv − overhead
          = 12 − 1.80 − 0.05 − 1.30
          = 8.85 GB

// fit decision
7.97 ≤ 8.85  → FITS
headroom  = 0.88 GB of weights budget left
Phi-414.7BQ4_K_M
8.3
fits
▸ show the math
// weights Q4_K_M for Phi-4 (14.7B params)
weights = params × bits ÷ 8
        = 14.7 × 4.5 ÷ 8
        = 8.27 GB

// budget on RTX 5070 (12GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.30 GB    (runtime, cuda, allocator)
safety    = 1.80 GB    (15% of 12GB)
budget    = vram − safety − kv − overhead
          = 12 − 1.80 − 0.05 − 1.30
          = 8.85 GB

// fit decision
8.27 ≤ 8.85  → FITS
headroom  = 0.58 GB of weights budget left
Qwen 3.5 9B9BQ6_K
7.4
fits
▸ show the math
// weights Q6_K for Qwen 3.5 9B (9B params)
weights = params × bits ÷ 8
        = 9 × 6.56 ÷ 8
        = 7.38 GB

// budget on RTX 5070 (12GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.30 GB    (runtime, cuda, allocator)
safety    = 1.80 GB    (15% of 12GB)
budget    = vram − safety − kv − overhead
          = 12 − 1.80 − 0.05 − 1.30
          = 8.85 GB

// fit decision
7.38 ≤ 8.85  → FITS
headroom  = 1.47 GB of weights budget left
Gemma 2 9B9BQ6_K
7.4
fits
▸ show the math
// weights Q6_K for Gemma 2 9B (9B params)
weights = params × bits ÷ 8
        = 9 × 6.56 ÷ 8
        = 7.38 GB

// budget on RTX 5070 (12GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.30 GB    (runtime, cuda, allocator)
safety    = 1.80 GB    (15% of 12GB)
budget    = vram − safety − kv − overhead
          = 12 − 1.80 − 0.05 − 1.30
          = 8.85 GB

// fit decision
7.38 ≤ 8.85  → FITS
headroom  = 1.47 GB of weights budget left
Llama 3.1 8B8BFP8/INT8
8.0
fits
▸ show the math
// weights FP8/INT8 for Llama 3.1 8B (8B params)
weights = params × bits ÷ 8
        = 8 × 8 ÷ 8
        = 8.00 GB

// budget on RTX 5070 (12GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.30 GB    (runtime, cuda, allocator)
safety    = 1.80 GB    (15% of 12GB)
budget    = vram − safety − kv − overhead
          = 12 − 1.80 − 0.05 − 1.30
          = 8.85 GB

// fit decision
8.00 ≤ 8.85  → FITS
headroom  = 0.85 GB of weights budget left
Granite 8B Code8BFP8/INT8
8.0
fits
▸ show the math
// weights FP8/INT8 for Granite 8B Code (8B params)
weights = params × bits ÷ 8
        = 8 × 8 ÷ 8
        = 8.00 GB

// budget on RTX 5070 (12GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.30 GB    (runtime, cuda, allocator)
safety    = 1.80 GB    (15% of 12GB)
budget    = vram − safety − kv − overhead
          = 12 − 1.80 − 0.05 − 1.30
          = 8.85 GB

// fit decision
8.00 ≤ 8.85  → FITS
headroom  = 0.85 GB of weights budget left
Mistral 7B v0.37.2BQ8_0
7.7
fits
▸ show the math
// weights Q8_0 for Mistral 7B v0.3 (7.2B params)
weights = params × bits ÷ 8
        = 7.2 × 8.5 ÷ 8
        = 7.65 GB

// budget on RTX 5070 (12GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.30 GB    (runtime, cuda, allocator)
safety    = 1.80 GB    (15% of 12GB)
budget    = vram − safety − kv − overhead
          = 12 − 1.80 − 0.05 − 1.30
          = 8.85 GB

// fit decision
7.65 ≤ 8.85  → FITS
headroom  = 1.20 GB of weights budget left
Qwen 2.5 7B7BQ8_0
7.4
fits
▸ show the math
// weights Q8_0 for Qwen 2.5 7B (7B params)
weights = params × bits ÷ 8
        = 7 × 8.5 ÷ 8
        = 7.44 GB

// budget on RTX 5070 (12GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.30 GB    (runtime, cuda, allocator)
safety    = 1.80 GB    (15% of 12GB)
budget    = vram − safety − kv − overhead
          = 12 − 1.80 − 0.05 − 1.30
          = 8.85 GB

// fit decision
7.44 ≤ 8.85  → FITS
headroom  = 1.41 GB of weights budget left
Gemma 4 E4B4BFP16/BF16
8.0
fits
▸ show the math
// weights FP16/BF16 for Gemma 4 E4B (4B params)
weights = params × bits ÷ 8
        = 4 × 16 ÷ 8
        = 8.00 GB

// budget on RTX 5070 (12GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.30 GB    (runtime, cuda, allocator)
safety    = 1.80 GB    (15% of 12GB)
budget    = vram − safety − kv − overhead
          = 12 − 1.80 − 0.05 − 1.30
          = 8.85 GB

// fit decision
8.00 ≤ 8.85  → FITS
headroom  = 0.85 GB of weights budget left
Phi-4 Mini3.8BFP16/BF16
7.6
fits
▸ show the math
// weights FP16/BF16 for Phi-4 Mini (3.8B params)
weights = params × bits ÷ 8
        = 3.8 × 16 ÷ 8
        = 7.60 GB

// budget on RTX 5070 (12GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.30 GB    (runtime, cuda, allocator)
safety    = 1.80 GB    (15% of 12GB)
budget    = vram − safety − kv − overhead
          = 12 − 1.80 − 0.05 − 1.30
          = 8.85 GB

// fit decision
7.60 ≤ 8.85  → FITS
headroom  = 1.25 GB of weights budget left
Llama 3.2 3B3.21BFP16/BF16
6.4
fits
▸ show the math
// weights FP16/BF16 for Llama 3.2 3B (3.21B params)
weights = params × bits ÷ 8
        = 3.21 × 16 ÷ 8
        = 6.42 GB

// budget on RTX 5070 (12GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.30 GB    (runtime, cuda, allocator)
safety    = 1.80 GB    (15% of 12GB)
budget    = vram − safety − kv − overhead
          = 12 − 1.80 − 0.05 − 1.30
          = 8.85 GB

// fit decision
6.42 ≤ 8.85  → FITS
headroom  = 2.43 GB of weights budget left
Llama 3.2 1B1.23BFP16/BF16
2.5
fits
▸ show the math
// weights FP16/BF16 for Llama 3.2 1B (1.23B params)
weights = params × bits ÷ 8
        = 1.23 × 16 ÷ 8
        = 2.46 GB

// budget on RTX 5070 (12GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.30 GB    (runtime, cuda, allocator)
safety    = 1.80 GB    (15% of 12GB)
budget    = vram − safety − kv − overhead
          = 12 − 1.80 − 0.05 − 1.30
          = 8.85 GB

// fit decision
2.46 ≤ 8.85  → FITS
headroom  = 6.39 GB of weights budget left

Compare to…

$ ./vrambudget --compare

Discussion.

$ gh discussion list

// sign in with github to leave a comment. threads live in the repo's discussions tab.