~/gpu/rtx-4080
nvidia manufacturer

RTX 4080 16GB

16GB. 30B at Q4 comfortably, 13B at Q8 with long context.

VRAM
16GB
Bandwidth
717GB/s
FP16 compute
390TFLOPS
Budget @ ctx 8K
12GB

Tuned to this card.

$ ./vrambudget --gpu rtx-4080
$ vrambudget --gpu rtx-4080 --ctx 8192 --conc 1 --safety 15%↗ tweetlive
ada
RTX 4060
8GB
ada
RTX 4060 Ti 16GB
16GB
ada
RTX 4070
12GB
ada
RTX 4070 Super
12GB
ada
RTX 4070 Ti Super
16GB
ada
RTX 4080
16GB
ada
RTX 4080 Super
16GB
ada · flagship
RTX 4090
24GB
16GB
64GB
8Ktok
16GB
device capacity
0.05GB
0.3% of total
1.4GB
8.8% of total
12GB
76% of total
$ budget allocation14 / 16 GB used
weightskv cacheoverheadsafety
↳ sorted by best fit
fitscomfortably runs on this budget16 models
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
12 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
11 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
10 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
12 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
10 GB
fits
Phi-414.7B
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
12 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
9.6 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
9.6 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
8.5 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
8.5 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
7.7 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
7.4 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
8.0 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
7.6 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
6.4 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
2.5 GB
fits
overneeds a bigger card, more aggressive quant, or model split14 models
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
17 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
18 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
18 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
19 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
20 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
26 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
40 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
41 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
59 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
66 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
79 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
228 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
377 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
377 GB
over

Models that fit on a RTX 4080.

$ grep "fits" models.json | head -12
ModelParamsBest quantWeights / 12 GB budgetFit
Qwen 3.6 27B27BQ3_K_M
12
fits
▸ show the math
// weights Q3_K_M for Qwen 3.6 27B (27B params)
weights = params × bits ÷ 8
        = 27 × 3.44 ÷ 8
        = 11.61 GB

// budget on RTX 4080 (16GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.40 GB    (runtime, cuda, allocator)
safety    = 2.40 GB    (15% of 16GB)
budget    = vram − safety − kv − overhead
          = 16 − 2.40 − 0.05 − 1.40
          = 12.15 GB

// fit decision
11.61 ≤ 12.15  → FITS
headroom  = 0.54 GB of weights budget left
Gemma 4 26B A4B26BQ3_K_M
11
fits
▸ show the math
// weights Q3_K_M for Gemma 4 26B A4B (26B params)
weights = params × bits ÷ 8
        = 26 × 3.44 ÷ 8
        = 11.18 GB

// budget on RTX 4080 (16GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.40 GB    (runtime, cuda, allocator)
safety    = 2.40 GB    (15% of 16GB)
budget    = vram − safety − kv − overhead
          = 16 − 2.40 − 0.05 − 1.40
          = 12.15 GB

// fit decision
11.18 ≤ 12.15  → FITS
headroom  = 0.97 GB of weights budget left
Mistral Small 324BQ3_K_M
10
fits
▸ show the math
// weights Q3_K_M for Mistral Small 3 (24B params)
weights = params × bits ÷ 8
        = 24 × 3.44 ÷ 8
        = 10.32 GB

// budget on RTX 4080 (16GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.40 GB    (runtime, cuda, allocator)
safety    = 2.40 GB    (15% of 16GB)
budget    = vram − safety − kv − overhead
          = 16 − 2.40 − 0.05 − 1.40
          = 12.15 GB

// fit decision
10.32 ≤ 12.15  → FITS
headroom  = 1.83 GB of weights budget left
gpt-oss 20B20.9BQ4_K_M
12
fits
▸ show the math
// weights Q4_K_M for gpt-oss 20B (20.9B params)
weights = params × bits ÷ 8
        = 20.9 × 4.5 ÷ 8
        = 11.76 GB

// budget on RTX 4080 (16GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.40 GB    (runtime, cuda, allocator)
safety    = 2.40 GB    (15% of 16GB)
budget    = vram − safety − kv − overhead
          = 16 − 2.40 − 0.05 − 1.40
          = 12.15 GB

// fit decision
11.76 ≤ 12.15  → FITS
headroom  = 0.39 GB of weights budget left
StarCoder2 15B15BQ5_K_M
10
fits
▸ show the math
// weights Q5_K_M for StarCoder2 15B (15B params)
weights = params × bits ÷ 8
        = 15 × 5.5 ÷ 8
        = 10.31 GB

// budget on RTX 4080 (16GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.40 GB    (runtime, cuda, allocator)
safety    = 2.40 GB    (15% of 16GB)
budget    = vram − safety − kv − overhead
          = 16 − 2.40 − 0.05 − 1.40
          = 12.15 GB

// fit decision
10.31 ≤ 12.15  → FITS
headroom  = 1.84 GB of weights budget left
Phi-414.7BQ5_K_M
10
fits
▸ show the math
// weights Q5_K_M for Phi-4 (14.7B params)
weights = params × bits ÷ 8
        = 14.7 × 5.5 ÷ 8
        = 10.11 GB

// budget on RTX 4080 (16GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.40 GB    (runtime, cuda, allocator)
safety    = 2.40 GB    (15% of 16GB)
budget    = vram − safety − kv − overhead
          = 16 − 2.40 − 0.05 − 1.40
          = 12.15 GB

// fit decision
10.11 ≤ 12.15  → FITS
headroom  = 2.04 GB of weights budget left
Qwen 3.5 9B9BQ8_0
9.6
fits
▸ show the math
// weights Q8_0 for Qwen 3.5 9B (9B params)
weights = params × bits ÷ 8
        = 9 × 8.5 ÷ 8
        = 9.56 GB

// budget on RTX 4080 (16GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.40 GB    (runtime, cuda, allocator)
safety    = 2.40 GB    (15% of 16GB)
budget    = vram − safety − kv − overhead
          = 16 − 2.40 − 0.05 − 1.40
          = 12.15 GB

// fit decision
9.56 ≤ 12.15  → FITS
headroom  = 2.59 GB of weights budget left
Gemma 2 9B9BQ8_0
9.6
fits
▸ show the math
// weights Q8_0 for Gemma 2 9B (9B params)
weights = params × bits ÷ 8
        = 9 × 8.5 ÷ 8
        = 9.56 GB

// budget on RTX 4080 (16GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.40 GB    (runtime, cuda, allocator)
safety    = 2.40 GB    (15% of 16GB)
budget    = vram − safety − kv − overhead
          = 16 − 2.40 − 0.05 − 1.40
          = 12.15 GB

// fit decision
9.56 ≤ 12.15  → FITS
headroom  = 2.59 GB of weights budget left
Llama 3.1 8B8BQ8_0
8.5
fits
▸ show the math
// weights Q8_0 for Llama 3.1 8B (8B params)
weights = params × bits ÷ 8
        = 8 × 8.5 ÷ 8
        = 8.50 GB

// budget on RTX 4080 (16GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.40 GB    (runtime, cuda, allocator)
safety    = 2.40 GB    (15% of 16GB)
budget    = vram − safety − kv − overhead
          = 16 − 2.40 − 0.05 − 1.40
          = 12.15 GB

// fit decision
8.50 ≤ 12.15  → FITS
headroom  = 3.65 GB of weights budget left
Granite 8B Code8BQ8_0
8.5
fits
▸ show the math
// weights Q8_0 for Granite 8B Code (8B params)
weights = params × bits ÷ 8
        = 8 × 8.5 ÷ 8
        = 8.50 GB

// budget on RTX 4080 (16GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.40 GB    (runtime, cuda, allocator)
safety    = 2.40 GB    (15% of 16GB)
budget    = vram − safety − kv − overhead
          = 16 − 2.40 − 0.05 − 1.40
          = 12.15 GB

// fit decision
8.50 ≤ 12.15  → FITS
headroom  = 3.65 GB of weights budget left
Mistral 7B v0.37.2BQ8_0
7.7
fits
▸ show the math
// weights Q8_0 for Mistral 7B v0.3 (7.2B params)
weights = params × bits ÷ 8
        = 7.2 × 8.5 ÷ 8
        = 7.65 GB

// budget on RTX 4080 (16GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.40 GB    (runtime, cuda, allocator)
safety    = 2.40 GB    (15% of 16GB)
budget    = vram − safety − kv − overhead
          = 16 − 2.40 − 0.05 − 1.40
          = 12.15 GB

// fit decision
7.65 ≤ 12.15  → FITS
headroom  = 4.50 GB of weights budget left
Qwen 2.5 7B7BQ8_0
7.4
fits
▸ show the math
// weights Q8_0 for Qwen 2.5 7B (7B params)
weights = params × bits ÷ 8
        = 7 × 8.5 ÷ 8
        = 7.44 GB

// budget on RTX 4080 (16GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.40 GB    (runtime, cuda, allocator)
safety    = 2.40 GB    (15% of 16GB)
budget    = vram − safety − kv − overhead
          = 16 − 2.40 − 0.05 − 1.40
          = 12.15 GB

// fit decision
7.44 ≤ 12.15  → FITS
headroom  = 4.71 GB of weights budget left

Compare to…

$ ./vrambudget --compare

Discussion.

$ gh discussion list

// sign in with github to leave a comment. threads live in the repo's discussions tab.