How Much RAM for Local LLMs? The Complete 2026 Guide

Running LLMs locally is all about memory. Everyone asks about VRAM first, but system RAM is just as critical. Too little and your model crashes on load. Too much and you overspent. This guide gives you the exact numbers for every model size and use case, from a 7B Llama on a gaming PC to a 70B Qwen on a workstation.

Quick Navigation:

How RAM Is Used RAM by Model Size CPU Offloading Recommendations

How RAM Is Used for Local LLMs

Key insight: When your GPU has enough VRAM, the model lives entirely on the GPU and RAM usage is minimal. When VRAM runs out, model layers spill into system RAM. This is called CPU offloading, and it is far slower but still functional.

There are three distinct ways RAM gets consumed when running a local LLM:

Model Weights

If running fully on CPU (no GPU), the entire model loads into RAM. Also used for layers that overflow from VRAM.

7B Q4: ~4 GB

13B Q4: ~7 GB

70B Q4: ~40 GB

KV Cache

The context window (your conversation history) is stored in a KV cache. Longer context = more RAM. This scales with both model size and context length.

4K context: ~0.5-1 GB

32K context: ~4-8 GB

128K context: ~16-32 GB

OS and Runtime

Your OS, inference tool (Ollama/llama.cpp), and background processes always consume RAM. Budget at least 4-6 GB for a lean Linux system, 8-10 GB for Windows.

Linux: ~4-6 GB overhead

Windows: ~8-10 GB overhead

macOS: ~6-8 GB overhead

RAM by Model Size

The table below shows total system RAM needed to run each model comfortably. These assume you have a GPU with enough VRAM for the listed quantization. For pure CPU inference, add the model size to your OS overhead instead of using VRAM.

RAM Requirements with a Dedicated GPU

Model Size	Example Models	Min RAM	Recommended RAM	Why
1B-3B	Llama 3.2 1B/3B, Phi-3 Mini	8 GB	16 GB	OS overhead + long context headroom
7B-8B	Llama 3.1 8B, Mistral 7B, Qwen 7B	16 GB	32 GB	Comfortable with 32K+ context
13B-14B	Llama 2 13B, Qwen 14B, Phi-4	16 GB	32 GB	16GB works but leaves little headroom
32B-34B	Qwen 32B, Llama 3 70B Q4 (partial)	32 GB	64 GB	Partial CPU offload common here
70B-72B	Llama 3.1 70B, Qwen 72B	32 GB	64 GB	Needs 64 GB for long context or CPU offload
100B+ (MoE)	Llama 4 Scout, Mixtral 8x22B	64 GB	128 GB	MoE models activate fewer params but load all weights

RAM Requirements for Pure CPU Inference (No GPU)

Running fully on CPU means the entire quantized model must fit in RAM, plus OS overhead. This is slower than GPU inference but works for any hardware.

Model	Q4_K_M Size	Q8_0 Size	Min RAM (Q4)	Speed (CPU)
7B / 8B	~4.4 GB	~8.0 GB	16 GB	10-20 tok/s
13B / 14B	~7.4 GB	~13.8 GB	16 GB	5-12 tok/s
32B / 34B	~19 GB	~34 GB	32 GB	2-6 tok/s
70B / 72B	~40 GB	~72 GB	64 GB	1-3 tok/s

Pro tip: CPU token speeds assume a modern 12-core+ CPU with DDR5 RAM. DDR5’s higher bandwidth (up to 96 GB/s) meaningfully improves CPU inference speed over DDR4 (up to 51 GB/s). For CPU-heavy workloads, the RAM bandwidth matters as much as the capacity.

CPU Offloading: When VRAM and RAM Work Together

CPU offloading lets you run models larger than your VRAM by splitting layers between GPU and CPU. Tools like llama.cpp (and by extension Ollama) handle this automatically. You set how many GPU layers to load and the rest spill to RAM.

Example: 70B on RTX 4090 (24 GB)

A Q4_K_M 70B model is ~40 GB. Your RTX 4090 has 24 GB VRAM. llama.cpp automatically puts ~55-60 GPU layers on the GPU and the rest in RAM. You need at least 32 GB of system RAM for this to work, and 64 GB to be comfortable.

~24 GB on GPU~16 GB spills to RAMSpeed: 5-10 tok/s

Example: 70B on RTX 3060 (12 GB)

With only 12 GB VRAM, only about 28 layers of a 70B Q4 model fit on the GPU. The remaining ~28 GB must live in RAM. This requires 64 GB system RAM and runs at 1-3 tok/s. Functional, but slow.

~12 GB on GPU~28 GB spills to RAMSpeed: 1-3 tok/s

Rule of thumb

Have at least 2x the amount of model overflow available as free RAM after your OS overhead. If 16 GB of model layers spill to RAM, you want 32+ GB free RAM to avoid thrashing. Running tight causes significant slowdowns as the OS starts swapping to disk.

Context Window and RAM

Long context support is one of the biggest reasons people underestimate RAM requirements. Every token in your conversation history occupies space in the KV cache.

Context Length	7B Model	13B Model	70B Model
2K (default)	~0.5 GB	~0.8 GB	~2 GB
8K	~1.5 GB	~2.5 GB	~6 GB
32K	~4 GB	~8 GB	~22 GB
128K	~16 GB	~30 GB	~88 GB

Practical note: Ollama sets a default context of 2048 tokens. If you want longer conversations or use RAG pipelines, set OLLAMA_NUM_CTX explicitly. A 32K context with a 13B model adds about 8 GB to your memory budget.

RAM Recommendations by Use Case

32 GB: The Sweet Spot for Most Users

Covers all models up to 13B at any context length, and 70B Q4 with a 24 GB GPU via partial offload. Handles RAG pipelines and multi-turn conversations without issues. This is the minimum we recommend for a dedicated AI workstation.

Best for: 7B-13B models, long context, Ollama + Open WebUI, RAG pipelines

64 GB: For Serious 70B Work

Comfortably runs 70B models with partial GPU offloading, handles 32K+ context windows, and leaves room for multiple models in memory simultaneously. The right choice if you run Llama 3.1 70B or Qwen 72B daily.

Best for: 70B models, multi-model setups, long-context RAG, power users

128 GB: MoE and 100B+ Models

Required for large MoE models like Llama 4 Scout (109B active) or Mixtral 8x22B loaded fully in RAM. Also the minimum for CPU-only 70B inference at usable speeds. Typically needs DDR5 for adequate bandwidth.

Best for: Llama 4, Mixtral 8x22B, CPU-only 70B, production inference servers

16 GB: Possible but Limiting

You can run 7B models comfortably on 16 GB with a dedicated GPU. But with Windows eating 8-10 GB and a 7B model needing OS overhead plus context cache, headroom is tight. You will hit memory pressure with longer contexts or if you have other applications open.

Works for: 7B models short context on Linux, not recommended for Windows

DDR4 vs DDR5 for Local LLMs

For GPU inference, the difference is small because the bottleneck is VRAM bandwidth, not RAM bandwidth. For CPU inference or heavy offloading, it matters significantly.

RAM Type	Bandwidth	GPU Inference	CPU Inference	Verdict
DDR4-3200	~50 GB/s	Fine	Bottleneck	Good for GPU setups
DDR4-3600	~57 GB/s	Fine	Acceptable	Sweet spot for DDR4
DDR5-5600	~89 GB/s	Fine	Good	30-40% faster CPU inference vs DDR4
DDR5-6400+	~102 GB/s	Fine	Best	Best for CPU-heavy or offload setups

Frequently Asked Questions

Do I need more RAM if I have more VRAM?

Not necessarily more, but the baseline stays the same. If your model fits entirely in VRAM, RAM is only used for the OS and context cache. More VRAM means less offloading pressure on RAM, but you still need 16-32 GB for the system to run smoothly.

Can I use swap/virtual memory instead of buying more RAM?

Technically yes, practically no. SSD swap is 10-100x slower than RAM. A 70B model running with layers on a NVMe swap drive will generate 1-3 tokens per minute, not per second. Get the real RAM.

Does running multiple models at once multiply RAM usage?

Yes, if you keep models loaded in memory simultaneously. Ollama by default keeps a model in memory for 5 minutes after the last request. Running two 7B models at once uses roughly 2x the RAM. You can control this with OLLAMA_KEEP_ALIVE.

What about Apple Silicon (M1/M2/M3/M4)?

Apple Silicon uses unified memory shared between CPU and GPU. Your 64 GB M3 Max is equivalent to having 64 GB VRAM. This changes everything: a 70B Q4 model needs 40 GB of unified memory, and you still need room for the OS. A 64 GB Mac can run 70B models natively, which no x86 GPU under $3000 can match.

Ready to Build Your Local LLM Rig?

Compare RAM Kits

Browse Complete Builds

Need help picking a GPU too? See our Best GPUs for Deep Learning guide.