$ /learn · zero to first model · 12 min read

Learn.

You've heard about running LLMs locally. You're tired of paying per token. You don't know where to start. This is six short chapters that take you from zero to your first model running on your own hardware. No prior ML required.

$ ls chapters/
01ParametersWhat the numbers in a model name mean.02BitsWhy 4 vs 8 vs 16 actually matters.03KV cacheThe silent VRAM killer.04Picking hardwareGPU, Apple Silicon, or rent the cloud.05Picking a runtimeOllama, LM Studio, vLLM, or MLX.06Putting it togetherYour first model on your hardware.

01 Parameters

A model name like "Llama 3.3 70B" or "Phi-4 14.7B" contains the most important piece of information about the model: the number after the name is the count of parameters, measured in billions.

Parameters are the learned values inside the neural network: the numbers that got tuned during training. More parameters generally means more capacity (smarter answers, better long-context behavior), but the relationship is non-linear and depends heavily on training data and architecture.

For memory math, parameters are the first input. Everything else follows.

Try this

Open the model catalog. Scan the family-grouped cards. Notice the param counts: 1B (edge), 7-8B (workstation), 30-70B (serious), 405B+ (multi-GPU territory).

02 Bits

Each parameter has to be stored somewhere. The natural format is a floating-point number with 16 bits of precision (FP16). 16 bits = 2 bytes. So a 70B-parameter model at FP16 takes 70 × 2 = 140 GB of memory just for the weights.

But you can store each parameter with fewer bits, accepting a small quality loss in exchange. This is called quantization. The most common compression points:

Llama 3.3 70B at each quant
FP16 (reference) 140 GB
FP8 / INT8 (near-lossless) 70 GB
Q5_K_M (very small loss) 48 GB
Q4_K_M (production floor) 39 GB
Q3_K_M (visible loss) 30 GB

The formula is just params × bits ÷ 8 = gigabytes. That's the whole thesis of this site. Read /the-math if you want the full derivation.

Try this

Open the calculator's By-size tab. Drag the params slider to 70. See the 9-quant grid update. That grid is this formula, rendered live.

03 KV cache

The model needs to remember the previous tokens in the conversation. That memory is called the KV cache. It grows linearly with the length of the conversation (the context window) and with how many parallel requests you serve (concurrency).

At a normal chat length (a few thousand tokens, one request at a time) the KV cache is small. At 32K context with 4 concurrent users, it can be larger than the model weights themselves. This is the line item that catches people off guard.

Plus the runtime itself reserves some memory (framework overhead), and a sensible safety headroom buffer against fragmentation. The full equation is:

What's actually left for weights
budget =
  total_vram × (1 − safety)
   kv_cache × concurrency
   framework_overhead

That's the entire calculator. Three subtractions and one multiplication.

04 Picking hardware

Three real options for running LLMs at home:

Consumer NVIDIA GPU. RTX 3090/4090/5090, 24-32 GB of VRAM. The mainstream answer. Best price-to-performance for single-user workflows. Cap out around the 70B class at Q3-Q4 quants.

Apple Silicon Mac. M2/M3/M4/M5 Max or Ultra, 64-512 GB of unified memory. The unified memory model means the GPU has access to the entire system RAM. Single Mac Studio runs models that would need multi-GPU rigs on NVIDIA. Slower per-token than equivalent NVIDIA (less compute, less memory bandwidth), but the memory ceiling is much higher.

Datacenter / workstation. A6000, RTX 6000 Ada, H100, B200, MI300X. Expensive (US$ 4k-30k+ per card). Only worth it if you're building serving infrastructure for many users, or you need 405B+ quality at FP16.

Try this

Open the GPU catalog. Click a card you're considering buying. See the per-card calculator and the fit table of which models actually run on it.

05 Picking a runtime

A runtime is the program that actually loads the model and serves inference. The mainstream open-source options:

Ollama — the easiest start. Single-binary CLI, model registry. Run ollama run llama3.3 and you're done. Cross-platform.

LM Studio — the easiest GUI. Browse and download models in-app, chat in a window. Mac, Windows, Linux.

vLLM — production-grade. PagedAttention, continuous batching, OpenAI-compatible API. What real serving infra is built on. Heavier setup than Ollama.

oMLX — the Apple-native answer. Built on MLX, Apple's array framework. Paged SSD KV cache means coding agents (Claude Code, Cursor) get sub-5s TTFT after the first turn. Mac-only.

06 Putting it together

You have everything. Pick a target model. Look up its size at the quant you want. Subtract overhead and KV cache. Compare to your hardware's budget. If it fits, install a runtime, pull the model, and chat.

The shortest path:

  1. Pick a model from /model. If you're unsure, start with Llama 3.1 8B (works on almost anything) or Qwen 2.5 7B.
  2. Check /can-i-run/llama-3-1-8b (or your chosen model) to confirm what fits.
  3. Install Ollama (curl -fsSL https://ollama.com/install.sh | sh) or download LM Studio.
  4. Run ollama run llama3.1:8b. Or in LM Studio: search, download, load, chat.
  5. Open the calculator and tune the sliders to match your real workload (context length, concurrency). Share the URL with friends running similar hardware.
$ done. now what?

You can compute your budget, you can pick hardware, you can pick a runtime, you can run a model. The rest is operational stuff: quantization tradeoffs, tool calling, prompt engineering, multi-GPU setups. None of it is gatekept. The math you just learned is the foundation; everything else is patterns built on top.

$ next: open the calculator → · browse the glossary →

Discussion.

$ gh discussion list

// sign in with github to leave a comment. threads live in the repo's discussions tab.