~/gpu/rtx-4090
nvidia manufacturer

RTX 4090 24GB

The consumer flagship. Comfortably runs Llama-3.1 70B at Q4_K_M, Mixtral 8x7B at Q5, and any 30B-class model at Q8_0.

VRAM
24GB
Bandwidth
1,008GB/s
FP16 compute
330TFLOPS
Budget @ ctx 8K
18GB

Tuned to this card.

$ ./vrambudget --gpu rtx-4090
$ vrambudget --gpu rtx-4090 --ctx 8192 --conc 1 --safety 15%↗ tweetlive
ada
RTX 4060
8GB
ada
RTX 4060 Ti 16GB
16GB
ada
RTX 4070
12GB
ada
RTX 4070 Super
12GB
ada
RTX 4070 Ti Super
16GB
ada
RTX 4080
16GB
ada
RTX 4080 Super
16GB
ada · flagship
RTX 4090
24GB
24GB
64GB
8Ktok
24GB
device capacity
0.05GB
0.2% of total
1.6GB
6.7% of total
19GB
78% of total
$ budget allocation20 / 24 GB used
weightskv cacheoverheadsafety
↳ sorted by best fit
fitscomfortably runs on this budget21 models
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
19 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
18 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
18 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
18 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
17 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
19 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
18 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
17 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
17 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
16 GB
fits
Phi-414.7B
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
16 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
18 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
18 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
16 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
16 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
14 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
14 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
8.0 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
7.6 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
6.4 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
2.5 GB
fits
overneeds a bigger card, more aggressive quant, or model split9 models
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
26 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
40 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
41 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
59 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
66 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
79 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
228 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
377 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
377 GB
over

Models that fit on a RTX 4090.

$ grep "fits" models.json | head -12
ModelParamsBest quantWeights / 18 GB budgetFit
Qwen 3.6 35B A3B35BQ3_K_M
15
fits
▸ show the math
// weights Q3_K_M for Qwen 3.6 35B A3B (35B params)
weights = params × bits ÷ 8
        = 35 × 3.44 ÷ 8
        = 15.05 GB

// budget on RTX 4090 (24GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.60 GB    (runtime, cuda, allocator)
safety    = 3.60 GB    (15% of 24GB)
budget    = vram − safety − kv − overhead
          = 24 − 3.60 − 0.05 − 1.60
          = 18.75 GB

// fit decision
15.05 ≤ 18.75  → FITS
headroom  = 3.70 GB of weights budget left
Yi 34B34BQ3_K_M
15
fits
▸ show the math
// weights Q3_K_M for Yi 34B (34B params)
weights = params × bits ÷ 8
        = 34 × 3.44 ÷ 8
        = 14.62 GB

// budget on RTX 4090 (24GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.60 GB    (runtime, cuda, allocator)
safety    = 3.60 GB    (15% of 24GB)
budget    = vram − safety − kv − overhead
          = 24 − 3.60 − 0.05 − 1.60
          = 18.75 GB

// fit decision
14.62 ≤ 18.75  → FITS
headroom  = 4.13 GB of weights budget left
Qwen 2.5 32B32.5BAWQ 4-BIT
17
fits
▸ show the math
// weights AWQ 4-bit for Qwen 2.5 32B (32.5B params)
weights = params × bits ÷ 8
        = 32.5 × 4.25 ÷ 8
        = 17.27 GB

// budget on RTX 4090 (24GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.60 GB    (runtime, cuda, allocator)
safety    = 3.60 GB    (15% of 24GB)
budget    = vram − safety − kv − overhead
          = 24 − 3.60 − 0.05 − 1.60
          = 18.75 GB

// fit decision
17.27 ≤ 18.75  → FITS
headroom  = 1.48 GB of weights budget left
Qwen 2.5 Coder 32B32.5BAWQ 4-BIT
17
fits
▸ show the math
// weights AWQ 4-bit for Qwen 2.5 Coder 32B (32.5B params)
weights = params × bits ÷ 8
        = 32.5 × 4.25 ÷ 8
        = 17.27 GB

// budget on RTX 4090 (24GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.60 GB    (runtime, cuda, allocator)
safety    = 3.60 GB    (15% of 24GB)
budget    = vram − safety − kv − overhead
          = 24 − 3.60 − 0.05 − 1.60
          = 18.75 GB

// fit decision
17.27 ≤ 18.75  → FITS
headroom  = 1.48 GB of weights budget left
Qwen3 30B A3B30.5BQ4_K_M
17
fits
▸ show the math
// weights Q4_K_M for Qwen3 30B A3B (30.5B params)
weights = params × bits ÷ 8
        = 30.5 × 4.5 ÷ 8
        = 17.16 GB

// budget on RTX 4090 (24GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.60 GB    (runtime, cuda, allocator)
safety    = 3.60 GB    (15% of 24GB)
budget    = vram − safety − kv − overhead
          = 24 − 3.60 − 0.05 − 1.60
          = 18.75 GB

// fit decision
17.16 ≤ 18.75  → FITS
headroom  = 1.59 GB of weights budget left
Qwen 3.6 27B27BQ4_K_M
15
fits
▸ show the math
// weights Q4_K_M for Qwen 3.6 27B (27B params)
weights = params × bits ÷ 8
        = 27 × 4.5 ÷ 8
        = 15.19 GB

// budget on RTX 4090 (24GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.60 GB    (runtime, cuda, allocator)
safety    = 3.60 GB    (15% of 24GB)
budget    = vram − safety − kv − overhead
          = 24 − 3.60 − 0.05 − 1.60
          = 18.75 GB

// fit decision
15.19 ≤ 18.75  → FITS
headroom  = 3.56 GB of weights budget left
Gemma 4 26B A4B26BQ5_K_M
18
fits
▸ show the math
// weights Q5_K_M for Gemma 4 26B A4B (26B params)
weights = params × bits ÷ 8
        = 26 × 5.5 ÷ 8
        = 17.88 GB

// budget on RTX 4090 (24GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.60 GB    (runtime, cuda, allocator)
safety    = 3.60 GB    (15% of 24GB)
budget    = vram − safety − kv − overhead
          = 24 − 3.60 − 0.05 − 1.60
          = 18.75 GB

// fit decision
17.88 ≤ 18.75  → FITS
headroom  = 0.87 GB of weights budget left
Mistral Small 324BQ5_K_M
17
fits
▸ show the math
// weights Q5_K_M for Mistral Small 3 (24B params)
weights = params × bits ÷ 8
        = 24 × 5.5 ÷ 8
        = 16.50 GB

// budget on RTX 4090 (24GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.60 GB    (runtime, cuda, allocator)
safety    = 3.60 GB    (15% of 24GB)
budget    = vram − safety − kv − overhead
          = 24 − 3.60 − 0.05 − 1.60
          = 18.75 GB

// fit decision
16.50 ≤ 18.75  → FITS
headroom  = 2.25 GB of weights budget left
gpt-oss 20B20.9BQ6_K
17
fits
▸ show the math
// weights Q6_K for gpt-oss 20B (20.9B params)
weights = params × bits ÷ 8
        = 20.9 × 6.56 ÷ 8
        = 17.14 GB

// budget on RTX 4090 (24GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.60 GB    (runtime, cuda, allocator)
safety    = 3.60 GB    (15% of 24GB)
budget    = vram − safety − kv − overhead
          = 24 − 3.60 − 0.05 − 1.60
          = 18.75 GB

// fit decision
17.14 ≤ 18.75  → FITS
headroom  = 1.61 GB of weights budget left
StarCoder2 15B15BQ8_0
16
fits
▸ show the math
// weights Q8_0 for StarCoder2 15B (15B params)
weights = params × bits ÷ 8
        = 15 × 8.5 ÷ 8
        = 15.94 GB

// budget on RTX 4090 (24GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.60 GB    (runtime, cuda, allocator)
safety    = 3.60 GB    (15% of 24GB)
budget    = vram − safety − kv − overhead
          = 24 − 3.60 − 0.05 − 1.60
          = 18.75 GB

// fit decision
15.94 ≤ 18.75  → FITS
headroom  = 2.81 GB of weights budget left
Phi-414.7BQ8_0
16
fits
▸ show the math
// weights Q8_0 for Phi-4 (14.7B params)
weights = params × bits ÷ 8
        = 14.7 × 8.5 ÷ 8
        = 15.62 GB

// budget on RTX 4090 (24GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.60 GB    (runtime, cuda, allocator)
safety    = 3.60 GB    (15% of 24GB)
budget    = vram − safety − kv − overhead
          = 24 − 3.60 − 0.05 − 1.60
          = 18.75 GB

// fit decision
15.62 ≤ 18.75  → FITS
headroom  = 3.13 GB of weights budget left
Qwen 3.5 9B9BFP16/BF16
18
fits
▸ show the math
// weights FP16/BF16 for Qwen 3.5 9B (9B params)
weights = params × bits ÷ 8
        = 9 × 16 ÷ 8
        = 18.00 GB

// budget on RTX 4090 (24GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 1.60 GB    (runtime, cuda, allocator)
safety    = 3.60 GB    (15% of 24GB)
budget    = vram − safety − kv − overhead
          = 24 − 3.60 − 0.05 − 1.60
          = 18.75 GB

// fit decision
18.00 ≤ 18.75  → FITS
headroom  = 0.75 GB of weights budget left

Compare to…

$ ./vrambudget --compare

Discussion.

$ gh discussion list

// sign in with github to leave a comment. threads live in the repo's discussions tab.