~/gpu/h100-nvl-2x
nvidia manufacturer

2× H100 NVL 188GB

Twin H100 NVL in NVLink. 188GB combined. The 'I want to run Llama-3.1 405B' rig.

VRAM
188GB
Bandwidth
3,938GB/s
FP16 compute
1979TFLOPS
Budget @ ctx 8K
151GB

Tuned to this card.

$ ./vrambudget --gpu h100-nvl-2x
$ vrambudget --gpu h100-nvl-2x --ctx 8192 --conc 1 --safety 15%↗ tweetlive
hopper
H100 80GB
80GB
hopper
H200
141GB
blackwell
B200
192GB
grace blackwell
DGX Spark
128GB
multi-gpu
2× H100 NVL
188GB
188GB
64GB
8Ktok
188GB
device capacity
0.05GB
0.0% of total
2.5GB
1.3% of total
157GB
84% of total
$ budget allocation160 / 188 GB used
weightskv cacheoverheadsafety
↳ sorted by best fit
fitscomfortably runs on this budget27 models
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
150 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
124 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
111 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
145 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
141 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
93 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
70 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
68 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
65 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
65 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
61 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
54 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
52 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
48 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
42 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
30 GB
fits
Phi-414.7B
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
29 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
18 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
18 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
16 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
16 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
14 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
14 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
8.0 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
7.6 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
6.4 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
2.5 GB
fits
overneeds a bigger card, more aggressive quant, or model split3 models
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
228 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
377 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
377 GB
over

Models that fit on a 2× H100 NVL.

$ grep "fits" models.json | head -12
ModelParamsBest quantWeights / 151 GB budgetFit
Mixtral 8x22B141BQ8_0
150
fits
▸ show the math
// weights Q8_0 for Mixtral 8x22B (141B params)
weights = params × bits ÷ 8
        = 141 × 8.5 ÷ 8
        = 149.81 GB

// budget on 2× H100 NVL (188GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 28.20 GB    (15% of 188GB)
budget    = vram − safety − kv − overhead
          = 188 − 28.20 − 0.05 − 2.50
          = 157.25 GB

// fit decision
149.81 ≤ 157.25  → FITS
headroom  = 7.44 GB of weights budget left
gpt-oss 120B117BQ8_0
124
fits
▸ show the math
// weights Q8_0 for gpt-oss 120B (117B params)
weights = params × bits ÷ 8
        = 117 × 8.5 ÷ 8
        = 124.31 GB

// budget on 2× H100 NVL (188GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 28.20 GB    (15% of 188GB)
budget    = vram − safety − kv − overhead
          = 188 − 28.20 − 0.05 − 2.50
          = 157.25 GB

// fit decision
124.31 ≤ 157.25  → FITS
headroom  = 32.94 GB of weights budget left
Command R+104BQ8_0
111
fits
▸ show the math
// weights Q8_0 for Command R+ (104B params)
weights = params × bits ÷ 8
        = 104 × 8.5 ÷ 8
        = 110.50 GB

// budget on 2× H100 NVL (188GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 28.20 GB    (15% of 188GB)
budget    = vram − safety − kv − overhead
          = 188 − 28.20 − 0.05 − 2.50
          = 157.25 GB

// fit decision
110.50 ≤ 157.25  → FITS
headroom  = 46.75 GB of weights budget left
Qwen 2.5 72B72.7BFP16/BF16
145
fits
▸ show the math
// weights FP16/BF16 for Qwen 2.5 72B (72.7B params)
weights = params × bits ÷ 8
        = 72.7 × 16 ÷ 8
        = 145.40 GB

// budget on 2× H100 NVL (188GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 28.20 GB    (15% of 188GB)
budget    = vram − safety − kv − overhead
          = 188 − 28.20 − 0.05 − 2.50
          = 157.25 GB

// fit decision
145.40 ≤ 157.25  → FITS
headroom  = 11.85 GB of weights budget left
Llama 3.3 70B70.6BFP16/BF16
141
fits
▸ show the math
// weights FP16/BF16 for Llama 3.3 70B (70.6B params)
weights = params × bits ÷ 8
        = 70.6 × 16 ÷ 8
        = 141.20 GB

// budget on 2× H100 NVL (188GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 28.20 GB    (15% of 188GB)
budget    = vram − safety − kv − overhead
          = 188 − 28.20 − 0.05 − 2.50
          = 157.25 GB

// fit decision
141.20 ≤ 157.25  → FITS
headroom  = 16.05 GB of weights budget left
Mixtral 8x7B46.7BFP16/BF16
93
fits
▸ show the math
// weights FP16/BF16 for Mixtral 8x7B (46.7B params)
weights = params × bits ÷ 8
        = 46.7 × 16 ÷ 8
        = 93.40 GB

// budget on 2× H100 NVL (188GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 28.20 GB    (15% of 188GB)
budget    = vram − safety − kv − overhead
          = 188 − 28.20 − 0.05 − 2.50
          = 157.25 GB

// fit decision
93.40 ≤ 157.25  → FITS
headroom  = 63.85 GB of weights budget left
Qwen 3.6 35B A3B35BFP16/BF16
70
fits
▸ show the math
// weights FP16/BF16 for Qwen 3.6 35B A3B (35B params)
weights = params × bits ÷ 8
        = 35 × 16 ÷ 8
        = 70.00 GB

// budget on 2× H100 NVL (188GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 28.20 GB    (15% of 188GB)
budget    = vram − safety − kv − overhead
          = 188 − 28.20 − 0.05 − 2.50
          = 157.25 GB

// fit decision
70.00 ≤ 157.25  → FITS
headroom  = 87.25 GB of weights budget left
Yi 34B34BFP16/BF16
68
fits
▸ show the math
// weights FP16/BF16 for Yi 34B (34B params)
weights = params × bits ÷ 8
        = 34 × 16 ÷ 8
        = 68.00 GB

// budget on 2× H100 NVL (188GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 28.20 GB    (15% of 188GB)
budget    = vram − safety − kv − overhead
          = 188 − 28.20 − 0.05 − 2.50
          = 157.25 GB

// fit decision
68.00 ≤ 157.25  → FITS
headroom  = 89.25 GB of weights budget left
Qwen 2.5 32B32.5BFP16/BF16
65
fits
▸ show the math
// weights FP16/BF16 for Qwen 2.5 32B (32.5B params)
weights = params × bits ÷ 8
        = 32.5 × 16 ÷ 8
        = 65.00 GB

// budget on 2× H100 NVL (188GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 28.20 GB    (15% of 188GB)
budget    = vram − safety − kv − overhead
          = 188 − 28.20 − 0.05 − 2.50
          = 157.25 GB

// fit decision
65.00 ≤ 157.25  → FITS
headroom  = 92.25 GB of weights budget left
Qwen 2.5 Coder 32B32.5BFP16/BF16
65
fits
▸ show the math
// weights FP16/BF16 for Qwen 2.5 Coder 32B (32.5B params)
weights = params × bits ÷ 8
        = 32.5 × 16 ÷ 8
        = 65.00 GB

// budget on 2× H100 NVL (188GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 28.20 GB    (15% of 188GB)
budget    = vram − safety − kv − overhead
          = 188 − 28.20 − 0.05 − 2.50
          = 157.25 GB

// fit decision
65.00 ≤ 157.25  → FITS
headroom  = 92.25 GB of weights budget left
Qwen3 30B A3B30.5BFP16/BF16
61
fits
▸ show the math
// weights FP16/BF16 for Qwen3 30B A3B (30.5B params)
weights = params × bits ÷ 8
        = 30.5 × 16 ÷ 8
        = 61.00 GB

// budget on 2× H100 NVL (188GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 28.20 GB    (15% of 188GB)
budget    = vram − safety − kv − overhead
          = 188 − 28.20 − 0.05 − 2.50
          = 157.25 GB

// fit decision
61.00 ≤ 157.25  → FITS
headroom  = 96.25 GB of weights budget left
Qwen 3.6 27B27BFP16/BF16
54
fits
▸ show the math
// weights FP16/BF16 for Qwen 3.6 27B (27B params)
weights = params × bits ÷ 8
        = 27 × 16 ÷ 8
        = 54.00 GB

// budget on 2× H100 NVL (188GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 28.20 GB    (15% of 188GB)
budget    = vram − safety − kv − overhead
          = 188 − 28.20 − 0.05 − 2.50
          = 157.25 GB

// fit decision
54.00 ≤ 157.25  → FITS
headroom  = 103.25 GB of weights budget left

Compare to…

$ ./vrambudget --compare

Discussion.

$ gh discussion list

// sign in with github to leave a comment. threads live in the repo's discussions tab.