Can I run Llama 3.1 405B?

Short answer: yes, on a M3 Ultra 512 (512GB) at Q8_0. Long answer below.

The math, in one paragraph.

$ ./vrambudget --explain llama-3-1-405b

Llama 3.1 405B has 405B parameters. At FP16 that's 810 GB of raw weights. Quantization shrinks that, but you also need budget for the KV cache (definition), framework overhead, and safety headroom. The rule of thumb: real usable budget on a card is roughly its nameplate VRAM minus 25%. That's how the table below was computed.

What hardware actually fits.

$ grep "fits" gpus.json

FP16/BF16

810GB

0 GPUs fit

— none in the catalog —

Q8_0

430GB

1 GPU fits

M3 Ultra 512512GB

Q5_K_M

278GB

1 GPU fits

M3 Ultra 512512GB

Q4_K_M

228GB

1 GPU fits

M3 Ultra 512512GB

Q3_K_M

174GB

1 GPU fits

M3 Ultra 512512GB

Pick your path.

$ ls strategies/

Tightest budget

Smallest GPU that fits Llama 3.1 405B at any quant: M3 Ultra 512 at Q8_0.

Reference quality (FP16)

Lossless inference needs 810 GB. Pick from multi-GPU only.

Best quality on a 24GB card

None of the showcase quants fit on a 24GB card. Step up.

Tune the math yourself

Open the calculator pre-tuned for Llama 3.1 405B: ↗ /calc?model=llama-3-1-405b

See the full model page.

$ ./open

$ openLlama 3.1 405B405BMeta · ctx 128K $ rankBest GPU for Llama 3.1 405Bsee the ranking →ranked by fit quality

Discussion.

$ gh discussion list

// sign in with github to leave a comment. threads live in the repo's discussions tab.