Running LLMs locally is all about memory. Everyone asks about VRAM first, but system RAM is just as critical. Too little and your model crashes on load. Too much and you overspent. This guide gives you the exact numbers for every model size and use case, from a 7B Llama on a gaming PC to a 70B Qwen on a workstation.
Quick Navigation:
How RAM Is Used for Local LLMs
Key insight: When your GPU has enough VRAM, the model lives entirely on the GPU and RAM usage is minimal. When VRAM runs out, model layers spill into system RAM. This is called CPU offloading, and it is far slower but still functional.
There are three distinct ways RAM gets consumed when running a local LLM:
Model Weights
If running fully on CPU (no GPU), the entire model loads into RAM. Also used for layers that overflow from VRAM.
7B Q4: ~4 GB
13B Q4: ~7 GB
70B Q4: ~40 GB
KV Cache
The context window (your conversation history) is stored in a KV cache. Longer context = more RAM. This scales with both model size and context length.
4K context: ~0.5-1 GB
32K context: ~4-8 GB
128K context: ~16-32 GB
OS and Runtime
Your OS, inference tool (Ollama/llama.cpp), and background processes always consume RAM. Budget at least 4-6 GB for a lean Linux system, 8-10 GB for Windows.
Linux: ~4-6 GB overhead
Windows: ~8-10 GB overhead
macOS: ~6-8 GB overhead
RAM by Model Size
The table below shows total system RAM needed to run each model comfortably. These assume you have a GPU with enough VRAM for the listed quantization. For pure CPU inference, add the model size to your OS overhead instead of using VRAM.
RAM Requirements with a Dedicated GPU
| Model Size | Example Models | Min RAM | Recommended RAM | Why |
|---|---|---|---|---|
| 1B-3B | Llama 3.2 1B/3B, Phi-3 Mini | 8 GB | 16 GB | OS overhead + long context headroom |
| 7B-8B | Llama 3.1 8B, Mistral 7B, Qwen 7B | 16 GB | 32 GB | Comfortable with 32K+ context |
| 13B-14B | Llama 2 13B, Qwen 14B, Phi-4 | 16 GB | 32 GB | 16GB works but leaves little headroom |
| 32B-34B | Qwen 32B, Llama 3 70B Q4 (partial) | 32 GB | 64 GB | Partial CPU offload common here |
| 70B-72B | Llama 3.1 70B, Qwen 72B | 32 GB | 64 GB | Needs 64 GB for long context or CPU offload |
| 100B+ (MoE) | Llama 4 Scout, Mixtral 8x22B | 64 GB | 128 GB | MoE models activate fewer params but load all weights |
RAM Requirements for Pure CPU Inference (No GPU)
Running fully on CPU means the entire quantized model must fit in RAM, plus OS overhead. This is slower than GPU inference but works for any hardware.
| Model | Q4_K_M Size | Q8_0 Size | Min RAM (Q4) | Speed (CPU) |
|---|---|---|---|---|
| 7B / 8B | ~4.4 GB | ~8.0 GB | 16 GB | 10-20 tok/s |
| 13B / 14B | ~7.4 GB | ~13.8 GB | 16 GB | 5-12 tok/s |
| 32B / 34B | ~19 GB | ~34 GB | 32 GB | 2-6 tok/s |
| 70B / 72B | ~40 GB | ~72 GB | 64 GB | 1-3 tok/s |
Pro tip: CPU token speeds assume a modern 12-core+ CPU with DDR5 RAM. DDR5’s higher bandwidth (up to 96 GB/s) meaningfully improves CPU inference speed over DDR4 (up to 51 GB/s). For CPU-heavy workloads, the RAM bandwidth matters as much as the capacity.
CPU Offloading: When VRAM and RAM Work Together
CPU offloading lets you run models larger than your VRAM by splitting layers between GPU and CPU. Tools like llama.cpp (and by extension Ollama) handle this automatically. You set how many GPU layers to load and the rest spill to RAM.
A Q4_K_M 70B model is ~40 GB. Your RTX 4090 has 24 GB VRAM. llama.cpp automatically puts ~55-60 GPU layers on the GPU and the rest in RAM. You need at least 32 GB of system RAM for this to work, and 64 GB to be comfortable.
With only 12 GB VRAM, only about 28 layers of a 70B Q4 model fit on the GPU. The remaining ~28 GB must live in RAM. This requires 64 GB system RAM and runs at 1-3 tok/s. Functional, but slow.
Have at least 2x the amount of model overflow available as free RAM after your OS overhead. If 16 GB of model layers spill to RAM, you want 32+ GB free RAM to avoid thrashing. Running tight causes significant slowdowns as the OS starts swapping to disk.
Context Window and RAM
Long context support is one of the biggest reasons people underestimate RAM requirements. Every token in your conversation history occupies space in the KV cache.
| Context Length | 7B Model | 13B Model | 70B Model |
|---|---|---|---|
| 2K (default) | ~0.5 GB | ~0.8 GB | ~2 GB |
| 8K | ~1.5 GB | ~2.5 GB | ~6 GB |
| 32K | ~4 GB | ~8 GB | ~22 GB |
| 128K | ~16 GB | ~30 GB | ~88 GB |
Practical note: Ollama sets a default context of 2048 tokens. If you want longer conversations or use RAG pipelines, set OLLAMA_NUM_CTX explicitly. A 32K context with a 13B model adds about 8 GB to your memory budget.
RAM Recommendations by Use Case
32 GB: The Sweet Spot for Most Users
Covers all models up to 13B at any context length, and 70B Q4 with a 24 GB GPU via partial offload. Handles RAG pipelines and multi-turn conversations without issues. This is the minimum we recommend for a dedicated AI workstation.
Best for: 7B-13B models, long context, Ollama + Open WebUI, RAG pipelines
64 GB: For Serious 70B Work
Comfortably runs 70B models with partial GPU offloading, handles 32K+ context windows, and leaves room for multiple models in memory simultaneously. The right choice if you run Llama 3.1 70B or Qwen 72B daily.
Best for: 70B models, multi-model setups, long-context RAG, power users
128 GB: MoE and 100B+ Models
Required for large MoE models like Llama 4 Scout (109B active) or Mixtral 8x22B loaded fully in RAM. Also the minimum for CPU-only 70B inference at usable speeds. Typically needs DDR5 for adequate bandwidth.
Best for: Llama 4, Mixtral 8x22B, CPU-only 70B, production inference servers
16 GB: Possible but Limiting
You can run 7B models comfortably on 16 GB with a dedicated GPU. But with Windows eating 8-10 GB and a 7B model needing OS overhead plus context cache, headroom is tight. You will hit memory pressure with longer contexts or if you have other applications open.
Works for: 7B models short context on Linux, not recommended for Windows
DDR4 vs DDR5 for Local LLMs
For GPU inference, the difference is small because the bottleneck is VRAM bandwidth, not RAM bandwidth. For CPU inference or heavy offloading, it matters significantly.
| RAM Type | Bandwidth | GPU Inference | CPU Inference | Verdict |
|---|---|---|---|---|
| DDR4-3200 | ~50 GB/s | Fine | Bottleneck | Good for GPU setups |
| DDR4-3600 | ~57 GB/s | Fine | Acceptable | Sweet spot for DDR4 |
| DDR5-5600 | ~89 GB/s | Fine | Good | 30-40% faster CPU inference vs DDR4 |
| DDR5-6400+ | ~102 GB/s | Fine | Best | Best for CPU-heavy or offload setups |
Frequently Asked Questions
Do I need more RAM if I have more VRAM?
Not necessarily more, but the baseline stays the same. If your model fits entirely in VRAM, RAM is only used for the OS and context cache. More VRAM means less offloading pressure on RAM, but you still need 16-32 GB for the system to run smoothly.
Can I use swap/virtual memory instead of buying more RAM?
Technically yes, practically no. SSD swap is 10-100x slower than RAM. A 70B model running with layers on a NVMe swap drive will generate 1-3 tokens per minute, not per second. Get the real RAM.
Does running multiple models at once multiply RAM usage?
Yes, if you keep models loaded in memory simultaneously. Ollama by default keeps a model in memory for 5 minutes after the last request. Running two 7B models at once uses roughly 2x the RAM. You can control this with OLLAMA_KEEP_ALIVE.
What about Apple Silicon (M1/M2/M3/M4)?
Apple Silicon uses unified memory shared between CPU and GPU. Your 64 GB M3 Max is equivalent to having 64 GB VRAM. This changes everything: a 70B Q4 model needs 40 GB of unified memory, and you still need room for the OS. A 64 GB Mac can run 70B models natively, which no x86 GPU under $3000 can match.
Related Reading
Ready to Build Your Local LLM Rig?
Need help picking a GPU too? See our Best GPUs for Deep Learning guide.