Learn.
You've heard about running LLMs locally. You're tired of paying per token. You don't know where to start. This is six short chapters that take you from zero to your first model running on your own hardware. No prior ML required.
01 Parameters
A model name like "Llama 3.3 70B" or "Phi-4 14.7B" contains the most important piece of information about the model: the number after the name is the count of parameters, measured in billions.
Parameters are the learned values inside the neural network: the numbers that got tuned during training. More parameters generally means more capacity (smarter answers, better long-context behavior), but the relationship is non-linear and depends heavily on training data and architecture.
For memory math, parameters are the first input. Everything else follows.
Open the model catalog. Scan the family-grouped cards. Notice the param counts: 1B (edge), 7-8B (workstation), 30-70B (serious), 405B+ (multi-GPU territory).
02 Bits
Each parameter has to be stored somewhere. The natural format is a floating-point number with 16 bits of precision (FP16). 16 bits = 2 bytes. So a 70B-parameter model at FP16 takes 70 × 2 = 140 GB of memory just for the weights.
But you can store each parameter with fewer bits, accepting a small quality loss in exchange. This is called quantization. The most common compression points:
FP8 / INT8 (near-lossless) → 70 GB
Q5_K_M (very small loss) → 48 GB
Q4_K_M (production floor) → 39 GB
Q3_K_M (visible loss) → 30 GB
The formula is just params × bits ÷ 8 = gigabytes. That's the whole thesis of this site. Read /the-math if you want the full derivation.
Open the calculator's By-size tab. Drag the params slider to 70. See the 9-quant grid update. That grid is this formula, rendered live.
03 KV cache
The model needs to remember the previous tokens in the conversation. That memory is called the KV cache. It grows linearly with the length of the conversation (the context window) and with how many parallel requests you serve (concurrency).
At a normal chat length (a few thousand tokens, one request at a time) the KV cache is small. At 32K context with 4 concurrent users, it can be larger than the model weights themselves. This is the line item that catches people off guard.
Plus the runtime itself reserves some memory (framework overhead), and a sensible safety headroom buffer against fragmentation. The full equation is:
total_vram × (1 − safety)
− kv_cache × concurrency
− framework_overhead
That's the entire calculator. Three subtractions and one multiplication.
04 Picking hardware
Three real options for running LLMs at home:
Consumer NVIDIA GPU. RTX 3090/4090/5090, 24-32 GB of VRAM. The mainstream answer. Best price-to-performance for single-user workflows. Cap out around the 70B class at Q3-Q4 quants.
Apple Silicon Mac. M2/M3/M4/M5 Max or Ultra, 64-512 GB of unified memory. The unified memory model means the GPU has access to the entire system RAM. Single Mac Studio runs models that would need multi-GPU rigs on NVIDIA. Slower per-token than equivalent NVIDIA (less compute, less memory bandwidth), but the memory ceiling is much higher.
Datacenter / workstation. A6000, RTX 6000 Ada, H100, B200, MI300X. Expensive (US$ 4k-30k+ per card). Only worth it if you're building serving infrastructure for many users, or you need 405B+ quality at FP16.
Open the GPU catalog. Click a card you're considering buying. See the per-card calculator and the fit table of which models actually run on it.
05 Picking a runtime
A runtime is the program that actually loads the model and serves inference. The mainstream open-source options:
Ollama — the easiest start. Single-binary CLI, model registry. Run ollama run llama3.3 and you're done. Cross-platform.
LM Studio — the easiest GUI. Browse and download models in-app, chat in a window. Mac, Windows, Linux.
vLLM — production-grade. PagedAttention, continuous batching, OpenAI-compatible API. What real serving infra is built on. Heavier setup than Ollama.
oMLX — the Apple-native answer. Built on MLX, Apple's array framework. Paged SSD KV cache means coding agents (Claude Code, Cursor) get sub-5s TTFT after the first turn. Mac-only.
06 Putting it together
You have everything. Pick a target model. Look up its size at the quant you want. Subtract overhead and KV cache. Compare to your hardware's budget. If it fits, install a runtime, pull the model, and chat.
The shortest path:
- Pick a model from /model. If you're unsure, start with Llama 3.1 8B (works on almost anything) or Qwen 2.5 7B.
- Check /can-i-run/llama-3-1-8b (or your chosen model) to confirm what fits.
- Install Ollama (
curl -fsSL https://ollama.com/install.sh | sh) or download LM Studio. - Run
ollama run llama3.1:8b. Or in LM Studio: search, download, load, chat. - Open the calculator and tune the sliders to match your real workload (context length, concurrency). Share the URL with friends running similar hardware.
You can compute your budget, you can pick hardware, you can pick a runtime, you can run a model. The rest is operational stuff: quantization tradeoffs, tool calling, prompt engineering, multi-GPU setups. None of it is gatekept. The math you just learned is the foundation; everything else is patterns built on top.
$ next: open the calculator → · browse the glossary →