~/gpu/m3-max-96
apple manufacturer

M3 Max 96 96GB

96GB unified. 70B at Q6 or 120B at Q4. Quiet, cool, portable.

VRAM
96GB
Bandwidth
400GB/s
FP16 compute
35TFLOPS
Budget @ ctx 8K
76GB

Tuned to this card.

$ ./vrambudget --gpu m3-max-96
$ vrambudget --gpu m3-max-96 --ctx 8192 --conc 1 --safety 15%↗ tweetlive
apple
M2 Max 64
64GB
apple
M2 Ultra 192
192GB
apple
M3 Max 64
64GB
apple
M3 Max 96
96GB
apple
M4 Pro 64
64GB
apple
M4 Max 128
128GB
apple · monster
M3 Ultra 512
512GB
apple · neural
M5 Pro 64
64GB
apple · flagship
M5 Max 128
128GB
96GB
64GB
8Ktok
96GB
device capacity
0.05GB
0.1% of total
2.5GB
2.6% of total
79GB
82% of total
$ budget allocation82 / 96 GB used
weightskv cacheoverheadsafety
↳ sorted by best fit
fitscomfortably runs on this budget27 models
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
75 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
66 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
72 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
77 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
75 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
50 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
70 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
68 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
65 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
65 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
61 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
54 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
52 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
48 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
42 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
30 GB
fits
Phi-414.7B
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
29 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
18 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
18 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
16 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
16 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
14 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
14 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
8.0 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
7.6 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
6.4 GB
fits
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
2.5 GB
fits
overneeds a bigger card, more aggressive quant, or model split3 models
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
228 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
377 GB
over
FP16/BF16FP8/INT8Q8_0Q6_KQ5_K_MQ4_K_MQ3_K_MAWQ 4-bitGPTQ 4-bit
377 GB
over

Models that fit on a M3 Max 96.

$ grep "fits" models.json | head -12
ModelParamsBest quantWeights / 76 GB budgetFit
Mixtral 8x22B141BAWQ 4-BIT
75
fits
▸ show the math
// weights AWQ 4-bit for Mixtral 8x22B (141B params)
weights = params × bits ÷ 8
        = 141 × 4.25 ÷ 8
        = 74.91 GB

// budget on M3 Max 96 (96GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 14.40 GB    (15% of 96GB)
budget    = vram − safety − kv − overhead
          = 96 − 14.40 − 0.05 − 2.50
          = 79.05 GB

// fit decision
74.91 ≤ 79.05  → FITS
headroom  = 4.14 GB of weights budget left
gpt-oss 120B117BQ4_K_M
66
fits
▸ show the math
// weights Q4_K_M for gpt-oss 120B (117B params)
weights = params × bits ÷ 8
        = 117 × 4.5 ÷ 8
        = 65.81 GB

// budget on M3 Max 96 (96GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 14.40 GB    (15% of 96GB)
budget    = vram − safety − kv − overhead
          = 96 − 14.40 − 0.05 − 2.50
          = 79.05 GB

// fit decision
65.81 ≤ 79.05  → FITS
headroom  = 13.24 GB of weights budget left
Command R+104BQ5_K_M
72
fits
▸ show the math
// weights Q5_K_M for Command R+ (104B params)
weights = params × bits ÷ 8
        = 104 × 5.5 ÷ 8
        = 71.50 GB

// budget on M3 Max 96 (96GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 14.40 GB    (15% of 96GB)
budget    = vram − safety − kv − overhead
          = 96 − 14.40 − 0.05 − 2.50
          = 79.05 GB

// fit decision
71.50 ≤ 79.05  → FITS
headroom  = 7.55 GB of weights budget left
Qwen 2.5 72B72.7BFP8/INT8
73
fits
▸ show the math
// weights FP8/INT8 for Qwen 2.5 72B (72.7B params)
weights = params × bits ÷ 8
        = 72.7 × 8 ÷ 8
        = 72.70 GB

// budget on M3 Max 96 (96GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 14.40 GB    (15% of 96GB)
budget    = vram − safety − kv − overhead
          = 96 − 14.40 − 0.05 − 2.50
          = 79.05 GB

// fit decision
72.70 ≤ 79.05  → FITS
headroom  = 6.35 GB of weights budget left
Llama 3.3 70B70.6BQ8_0
75
fits
▸ show the math
// weights Q8_0 for Llama 3.3 70B (70.6B params)
weights = params × bits ÷ 8
        = 70.6 × 8.5 ÷ 8
        = 75.01 GB

// budget on M3 Max 96 (96GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 14.40 GB    (15% of 96GB)
budget    = vram − safety − kv − overhead
          = 96 − 14.40 − 0.05 − 2.50
          = 79.05 GB

// fit decision
75.01 ≤ 79.05  → FITS
headroom  = 4.04 GB of weights budget left
Mixtral 8x7B46.7BQ8_0
50
fits
▸ show the math
// weights Q8_0 for Mixtral 8x7B (46.7B params)
weights = params × bits ÷ 8
        = 46.7 × 8.5 ÷ 8
        = 49.62 GB

// budget on M3 Max 96 (96GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 14.40 GB    (15% of 96GB)
budget    = vram − safety − kv − overhead
          = 96 − 14.40 − 0.05 − 2.50
          = 79.05 GB

// fit decision
49.62 ≤ 79.05  → FITS
headroom  = 29.43 GB of weights budget left
Qwen 3.6 35B A3B35BFP16/BF16
70
fits
▸ show the math
// weights FP16/BF16 for Qwen 3.6 35B A3B (35B params)
weights = params × bits ÷ 8
        = 35 × 16 ÷ 8
        = 70.00 GB

// budget on M3 Max 96 (96GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 14.40 GB    (15% of 96GB)
budget    = vram − safety − kv − overhead
          = 96 − 14.40 − 0.05 − 2.50
          = 79.05 GB

// fit decision
70.00 ≤ 79.05  → FITS
headroom  = 9.05 GB of weights budget left
Yi 34B34BFP16/BF16
68
fits
▸ show the math
// weights FP16/BF16 for Yi 34B (34B params)
weights = params × bits ÷ 8
        = 34 × 16 ÷ 8
        = 68.00 GB

// budget on M3 Max 96 (96GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 14.40 GB    (15% of 96GB)
budget    = vram − safety − kv − overhead
          = 96 − 14.40 − 0.05 − 2.50
          = 79.05 GB

// fit decision
68.00 ≤ 79.05  → FITS
headroom  = 11.05 GB of weights budget left
Qwen 2.5 32B32.5BFP16/BF16
65
fits
▸ show the math
// weights FP16/BF16 for Qwen 2.5 32B (32.5B params)
weights = params × bits ÷ 8
        = 32.5 × 16 ÷ 8
        = 65.00 GB

// budget on M3 Max 96 (96GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 14.40 GB    (15% of 96GB)
budget    = vram − safety − kv − overhead
          = 96 − 14.40 − 0.05 − 2.50
          = 79.05 GB

// fit decision
65.00 ≤ 79.05  → FITS
headroom  = 14.05 GB of weights budget left
Qwen 2.5 Coder 32B32.5BFP16/BF16
65
fits
▸ show the math
// weights FP16/BF16 for Qwen 2.5 Coder 32B (32.5B params)
weights = params × bits ÷ 8
        = 32.5 × 16 ÷ 8
        = 65.00 GB

// budget on M3 Max 96 (96GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 14.40 GB    (15% of 96GB)
budget    = vram − safety − kv − overhead
          = 96 − 14.40 − 0.05 − 2.50
          = 79.05 GB

// fit decision
65.00 ≤ 79.05  → FITS
headroom  = 14.05 GB of weights budget left
Qwen3 30B A3B30.5BFP16/BF16
61
fits
▸ show the math
// weights FP16/BF16 for Qwen3 30B A3B (30.5B params)
weights = params × bits ÷ 8
        = 30.5 × 16 ÷ 8
        = 61.00 GB

// budget on M3 Max 96 (96GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 14.40 GB    (15% of 96GB)
budget    = vram − safety − kv − overhead
          = 96 − 14.40 − 0.05 − 2.50
          = 79.05 GB

// fit decision
61.00 ≤ 79.05  → FITS
headroom  = 18.05 GB of weights budget left
Qwen 3.6 27B27BFP16/BF16
54
fits
▸ show the math
// weights FP16/BF16 for Qwen 3.6 27B (27B params)
weights = params × bits ÷ 8
        = 27 × 16 ÷ 8
        = 54.00 GB

// budget on M3 Max 96 (96GB) at ctx 8K, conc 1, 15% safety
kv_cache  = 0.05 GB    (1× at ctx 8K)
overhead  = 2.50 GB    (runtime, cuda, allocator)
safety    = 14.40 GB    (15% of 96GB)
budget    = vram − safety − kv − overhead
          = 96 − 14.40 − 0.05 − 2.50
          = 79.05 GB

// fit decision
54.00 ≤ 79.05  → FITS
headroom  = 25.05 GB of weights budget left

Compare to…

$ ./vrambudget --compare

Discussion.

$ gh discussion list

// sign in with github to leave a comment. threads live in the repo's discussions tab.