~/can-i-run/llama-3-1-405b
Meta provider

Can I run Llama 3.1 405B?

Short answer: yes, on a M3 Ultra 512 (512GB) at Q8_0. Long answer below.

llama3.1:405bLM StudiovLLMMLXoMLX

The math, in one paragraph.

$ ./vrambudget --explain llama-3-1-405b

Llama 3.1 405B has 405B parameters. At FP16 that's 810 GB of raw weights. Quantization shrinks that, but you also need budget for the KV cache (definition), framework overhead, and safety headroom. The rule of thumb: real usable budget on a card is roughly its nameplate VRAM minus 25%. That's how the table below was computed.

What hardware actually fits.

$ grep "fits" gpus.json
FP16/BF16
810GB
0 GPUs fit
— none in the catalog —
Q8_0
430GB
1 GPU fits
M3 Ultra 512512GB
Q5_K_M
278GB
1 GPU fits
M3 Ultra 512512GB
Q4_K_M
228GB
1 GPU fits
M3 Ultra 512512GB
Q3_K_M
174GB
1 GPU fits
M3 Ultra 512512GB

Pick your path.

$ ls strategies/
Tightest budget

Smallest GPU that fits Llama 3.1 405B at any quant: M3 Ultra 512 at Q8_0.

Reference quality (FP16)

Lossless inference needs 810 GB. Pick from multi-GPU only.

Best quality on a 24GB card

None of the showcase quants fit on a 24GB card. Step up.

Tune the math yourself

Open the calculator pre-tuned for Llama 3.1 405B: ↗ /calc?model=llama-3-1-405b

See the full model page.

$ ./open

Discussion.

$ gh discussion list

// sign in with github to leave a comment. threads live in the repo's discussions tab.