~/gpu/m5-max-128
apple manufacturer

M5 Max 128 128GB

128GB unified, 614 GB/s, 40-core GPU. M5 flagship; up to 4x AI compute over M4 Max. Runs 70B at FP16 with long context.

VRAM
128GB
Bandwidth
614GB/s
FP16 compute
55TFLOPS
Budget @ ctx 8K
103GB

Tuned to this card.

$ ./vrambudget --gpu m5-max-128
$ vrambudget --gpu m5-max-128 --ctx 8192 --conc 1 --safety 15%↗ tweetlive
apple
M2 Max 64
64GB
apple
M2 Ultra 192
192GB
apple
M3 Max 64
64GB
apple
M3 Max 96
96GB
apple
M4 Pro 64
64GB
apple
M4 Max 128
128GB
apple · monster
M3 Ultra 512
512GB
apple · neural
M5 Pro 64
64GB
apple · flagship
M5 Max 128
128GB
128GB
64GB
8Ktok
128GB
device capacity
0.05GB
0.0% of total
2.5GB
2.0% of total
106GB
83% of total
$ budget allocation109 / 128 GB used
weightskv cacheoverheadsafety
↳ sorted by best fit
fitscomfortably runs on this budget27 models
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
97 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
96 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
104 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
77 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
75 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
93 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
70 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
68 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
65 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
65 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
61 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
54 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
52 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
48 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
42 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
30 GB
fits
Phi-414.7B
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
29 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
18 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
18 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
16 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
16 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
14 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
14 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
8.0 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
7.6 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
6.4 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
2.5 GB
fits
overneeds a bigger card, more aggressive quant, or model split3 models
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
228 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
377 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
377 GB
over

Models that fit on a M5 Max 128.

$ grep "fits" models.json | head -12
ModelParamsBest quantWeights / 103 GB budgetFit
Mixtral 8x22B141BQ5_K_M
97
fits
▸ show the math
// weights Q5_K_M for Mixtral 8x22B (141B params)
weights = params × bits ÷ 8
        = 141 × 5.5 ÷ 8
        = 96.94 GB

// budget on M5 Max 128 (128GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 19.20 GB    (15% of 128GB)
budget    = vram − safety − kv − overhead
          = 128 − 19.20 − 0.05 − 2.50
          = 106.25 GB

// fit decision
96.94 ≤ 106.25  → FITS
headroom  = 9.31 GB of weights budget left
gpt-oss 120B117BQ6_K
96
fits
▸ show the math
// weights Q6_K for gpt-oss 120B (117B params)
weights = params × bits ÷ 8
        = 117 × 6.56 ÷ 8
        = 95.94 GB

// budget on M5 Max 128 (128GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 19.20 GB    (15% of 128GB)
budget    = vram − safety − kv − overhead
          = 128 − 19.20 − 0.05 − 2.50
          = 106.25 GB

// fit decision
95.94 ≤ 106.25  → FITS
headroom  = 10.31 GB of weights budget left
Command R+104BQ6_K
85
fits
▸ show the math
// weights Q6_K for Command R+ (104B params)
weights = params × bits ÷ 8
        = 104 × 6.56 ÷ 8
        = 85.28 GB

// budget on M5 Max 128 (128GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 19.20 GB    (15% of 128GB)
budget    = vram − safety − kv − overhead
          = 128 − 19.20 − 0.05 − 2.50
          = 106.25 GB

// fit decision
85.28 ≤ 106.25  → FITS
headroom  = 20.97 GB of weights budget left
Qwen 2.5 72B72.7BQ8_0
77
fits
▸ show the math
// weights Q8_0 for Qwen 2.5 72B (72.7B params)
weights = params × bits ÷ 8
        = 72.7 × 8.5 ÷ 8
        = 77.24 GB

// budget on M5 Max 128 (128GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 19.20 GB    (15% of 128GB)
budget    = vram − safety − kv − overhead
          = 128 − 19.20 − 0.05 − 2.50
          = 106.25 GB

// fit decision
77.24 ≤ 106.25  → FITS
headroom  = 29.00 GB of weights budget left
Llama 3.3 70B70.6BQ8_0
75
fits
▸ show the math
// weights Q8_0 for Llama 3.3 70B (70.6B params)
weights = params × bits ÷ 8
        = 70.6 × 8.5 ÷ 8
        = 75.01 GB

// budget on M5 Max 128 (128GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 19.20 GB    (15% of 128GB)
budget    = vram − safety − kv − overhead
          = 128 − 19.20 − 0.05 − 2.50
          = 106.25 GB

// fit decision
75.01 ≤ 106.25  → FITS
headroom  = 31.24 GB of weights budget left
Mixtral 8x7B46.7BFP16/BF16
93
fits
▸ show the math
// weights FP16/BF16 for Mixtral 8x7B (46.7B params)
weights = params × bits ÷ 8
        = 46.7 × 16 ÷ 8
        = 93.40 GB

// budget on M5 Max 128 (128GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 19.20 GB    (15% of 128GB)
budget    = vram − safety − kv − overhead
          = 128 − 19.20 − 0.05 − 2.50
          = 106.25 GB

// fit decision
93.40 ≤ 106.25  → FITS
headroom  = 12.85 GB of weights budget left
Qwen 3.6 35B A3B35BFP16/BF16
70
fits
▸ show the math
// weights FP16/BF16 for Qwen 3.6 35B A3B (35B params)
weights = params × bits ÷ 8
        = 35 × 16 ÷ 8
        = 70.00 GB

// budget on M5 Max 128 (128GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 19.20 GB    (15% of 128GB)
budget    = vram − safety − kv − overhead
          = 128 − 19.20 − 0.05 − 2.50
          = 106.25 GB

// fit decision
70.00 ≤ 106.25  → FITS
headroom  = 36.25 GB of weights budget left
Yi 34B34BFP16/BF16
68
fits
▸ show the math
// weights FP16/BF16 for Yi 34B (34B params)
weights = params × bits ÷ 8
        = 34 × 16 ÷ 8
        = 68.00 GB

// budget on M5 Max 128 (128GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 19.20 GB    (15% of 128GB)
budget    = vram − safety − kv − overhead
          = 128 − 19.20 − 0.05 − 2.50
          = 106.25 GB

// fit decision
68.00 ≤ 106.25  → FITS
headroom  = 38.25 GB of weights budget left
Qwen 2.5 32B32.5BFP16/BF16
65
fits
▸ show the math
// weights FP16/BF16 for Qwen 2.5 32B (32.5B params)
weights = params × bits ÷ 8
        = 32.5 × 16 ÷ 8
        = 65.00 GB

// budget on M5 Max 128 (128GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 19.20 GB    (15% of 128GB)
budget    = vram − safety − kv − overhead
          = 128 − 19.20 − 0.05 − 2.50
          = 106.25 GB

// fit decision
65.00 ≤ 106.25  → FITS
headroom  = 41.25 GB of weights budget left
Qwen 2.5 Coder 32B32.5BFP16/BF16
65
fits
▸ show the math
// weights FP16/BF16 for Qwen 2.5 Coder 32B (32.5B params)
weights = params × bits ÷ 8
        = 32.5 × 16 ÷ 8
        = 65.00 GB

// budget on M5 Max 128 (128GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 19.20 GB    (15% of 128GB)
budget    = vram − safety − kv − overhead
          = 128 − 19.20 − 0.05 − 2.50
          = 106.25 GB

// fit decision
65.00 ≤ 106.25  → FITS
headroom  = 41.25 GB of weights budget left
Qwen3 30B A3B30.5BFP16/BF16
61
fits
▸ show the math
// weights FP16/BF16 for Qwen3 30B A3B (30.5B params)
weights = params × bits ÷ 8
        = 30.5 × 16 ÷ 8
        = 61.00 GB

// budget on M5 Max 128 (128GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 19.20 GB    (15% of 128GB)
budget    = vram − safety − kv − overhead
          = 128 − 19.20 − 0.05 − 2.50
          = 106.25 GB

// fit decision
61.00 ≤ 106.25  → FITS
headroom  = 45.25 GB of weights budget left
Qwen 3.6 27B27BFP16/BF16
54
fits
▸ show the math
// weights FP16/BF16 for Qwen 3.6 27B (27B params)
weights = params × bits ÷ 8
        = 27 × 16 ÷ 8
        = 54.00 GB

// budget on M5 Max 128 (128GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 19.20 GB    (15% of 128GB)
budget    = vram − safety − kv − overhead
          = 128 − 19.20 − 0.05 − 2.50
          = 106.25 GB

// fit decision
54.00 ≤ 106.25  → FITS
headroom  = 52.25 GB of weights budget left

Compare to…

$ ./vrambudget --compare

Discussion.

$ gh discussion list

// sign in with github to leave a comment. threads live in the repo's discussions tab.