Skip to main content
How Much RAM for Local LLMs? The Complete 2026 Guide

How Much RAM for Local LLMs? The Complete 2026 Guide


Running LLMs locally is all about memory. Everyone asks about VRAM first, but system RAM is just as critical. Too little and your model crashes on load. Too much and you overspent. This guide gives you the exact numbers for every model size and use case, from a 7B Llama on a gaming PC to a 70B Qwen on a workstation.

How RAM Is Used for Local LLMs

Key insight: When your GPU has enough VRAM, the model lives entirely on the GPU and RAM usage is minimal. When VRAM runs out, model layers spill into system RAM. This is called CPU offloading, and it is far slower but still functional.

There are three distinct ways RAM gets consumed when running a local LLM:

1

Model Weights

If running fully on CPU (no GPU), the entire model loads into RAM. Also used for layers that overflow from VRAM.

7B Q4: ~4 GB

13B Q4: ~7 GB

70B Q4: ~40 GB

2

KV Cache

The context window (your conversation history) is stored in a KV cache. Longer context = more RAM. This scales with both model size and context length.

4K context: ~0.5-1 GB

32K context: ~4-8 GB

128K context: ~16-32 GB

3

OS and Runtime

Your OS, inference tool (Ollama/llama.cpp), and background processes always consume RAM. Budget at least 4-6 GB for a lean Linux system, 8-10 GB for Windows.

Linux: ~4-6 GB overhead

Windows: ~8-10 GB overhead

macOS: ~6-8 GB overhead

Advertisement

RAM by Model Size

The table below shows total system RAM needed to run each model comfortably. These assume you have a GPU with enough VRAM for the listed quantization. For pure CPU inference, add the model size to your OS overhead instead of using VRAM.

RAM Requirements with a Dedicated GPU

Model SizeExample ModelsMin RAMRecommended RAMWhy
1B-3BLlama 3.2 1B/3B, Phi-3 Mini8 GB16 GBOS overhead + long context headroom
7B-8BLlama 3.1 8B, Mistral 7B, Qwen 7B16 GB32 GBComfortable with 32K+ context
13B-14BLlama 2 13B, Qwen 14B, Phi-416 GB32 GB16GB works but leaves little headroom
32B-34BQwen 32B, Llama 3 70B Q4 (partial)32 GB64 GBPartial CPU offload common here
70B-72BLlama 3.1 70B, Qwen 72B32 GB64 GBNeeds 64 GB for long context or CPU offload
100B+ (MoE)Llama 4 Scout, Mixtral 8x22B64 GB128 GBMoE models activate fewer params but load all weights

RAM Requirements for Pure CPU Inference (No GPU)

Running fully on CPU means the entire quantized model must fit in RAM, plus OS overhead. This is slower than GPU inference but works for any hardware.

ModelQ4_K_M SizeQ8_0 SizeMin RAM (Q4)Speed (CPU)
7B / 8B~4.4 GB~8.0 GB16 GB10-20 tok/s
13B / 14B~7.4 GB~13.8 GB16 GB5-12 tok/s
32B / 34B~19 GB~34 GB32 GB2-6 tok/s
70B / 72B~40 GB~72 GB64 GB1-3 tok/s

Pro tip: CPU token speeds assume a modern 12-core+ CPU with DDR5 RAM. DDR5’s higher bandwidth (up to 96 GB/s) meaningfully improves CPU inference speed over DDR4 (up to 51 GB/s). For CPU-heavy workloads, the RAM bandwidth matters as much as the capacity.

Advertisement

CPU Offloading: When VRAM and RAM Work Together

CPU offloading lets you run models larger than your VRAM by splitting layers between GPU and CPU. Tools like llama.cpp (and by extension Ollama) handle this automatically. You set how many GPU layers to load and the rest spill to RAM.

Example: 70B on RTX 4090 (24 GB)

A Q4_K_M 70B model is ~40 GB. Your RTX 4090 has 24 GB VRAM. llama.cpp automatically puts ~55-60 GPU layers on the GPU and the rest in RAM. You need at least 32 GB of system RAM for this to work, and 64 GB to be comfortable.

~24 GB on GPU~16 GB spills to RAMSpeed: 5-10 tok/s
Example: 70B on RTX 3060 (12 GB)

With only 12 GB VRAM, only about 28 layers of a 70B Q4 model fit on the GPU. The remaining ~28 GB must live in RAM. This requires 64 GB system RAM and runs at 1-3 tok/s. Functional, but slow.

~12 GB on GPU~28 GB spills to RAMSpeed: 1-3 tok/s
Rule of thumb

Have at least 2x the amount of model overflow available as free RAM after your OS overhead. If 16 GB of model layers spill to RAM, you want 32+ GB free RAM to avoid thrashing. Running tight causes significant slowdowns as the OS starts swapping to disk.

Context Window and RAM

Long context support is one of the biggest reasons people underestimate RAM requirements. Every token in your conversation history occupies space in the KV cache.

Context Length7B Model13B Model70B Model
2K (default)~0.5 GB~0.8 GB~2 GB
8K~1.5 GB~2.5 GB~6 GB
32K~4 GB~8 GB~22 GB
128K~16 GB~30 GB~88 GB

Practical note: Ollama sets a default context of 2048 tokens. If you want longer conversations or use RAG pipelines, set OLLAMA_NUM_CTX explicitly. A 32K context with a 13B model adds about 8 GB to your memory budget.

Advertisement

RAM Recommendations by Use Case

32 GB: The Sweet Spot for Most Users

Covers all models up to 13B at any context length, and 70B Q4 with a 24 GB GPU via partial offload. Handles RAG pipelines and multi-turn conversations without issues. This is the minimum we recommend for a dedicated AI workstation.

Best for: 7B-13B models, long context, Ollama + Open WebUI, RAG pipelines

64 GB: For Serious 70B Work

Comfortably runs 70B models with partial GPU offloading, handles 32K+ context windows, and leaves room for multiple models in memory simultaneously. The right choice if you run Llama 3.1 70B or Qwen 72B daily.

Best for: 70B models, multi-model setups, long-context RAG, power users

128 GB: MoE and 100B+ Models

Required for large MoE models like Llama 4 Scout (109B active) or Mixtral 8x22B loaded fully in RAM. Also the minimum for CPU-only 70B inference at usable speeds. Typically needs DDR5 for adequate bandwidth.

Best for: Llama 4, Mixtral 8x22B, CPU-only 70B, production inference servers

16 GB: Possible but Limiting

You can run 7B models comfortably on 16 GB with a dedicated GPU. But with Windows eating 8-10 GB and a 7B model needing OS overhead plus context cache, headroom is tight. You will hit memory pressure with longer contexts or if you have other applications open.

Works for: 7B models short context on Linux, not recommended for Windows

DDR4 vs DDR5 for Local LLMs

For GPU inference, the difference is small because the bottleneck is VRAM bandwidth, not RAM bandwidth. For CPU inference or heavy offloading, it matters significantly.

RAM TypeBandwidthGPU InferenceCPU InferenceVerdict
DDR4-3200~50 GB/sFineBottleneckGood for GPU setups
DDR4-3600~57 GB/sFineAcceptableSweet spot for DDR4
DDR5-5600~89 GB/sFineGood30-40% faster CPU inference vs DDR4
DDR5-6400+~102 GB/sFineBestBest for CPU-heavy or offload setups

Frequently Asked Questions

Do I need more RAM if I have more VRAM?

Not necessarily more, but the baseline stays the same. If your model fits entirely in VRAM, RAM is only used for the OS and context cache. More VRAM means less offloading pressure on RAM, but you still need 16-32 GB for the system to run smoothly.

Can I use swap/virtual memory instead of buying more RAM?

Technically yes, practically no. SSD swap is 10-100x slower than RAM. A 70B model running with layers on a NVMe swap drive will generate 1-3 tokens per minute, not per second. Get the real RAM.

Does running multiple models at once multiply RAM usage?

Yes, if you keep models loaded in memory simultaneously. Ollama by default keeps a model in memory for 5 minutes after the last request. Running two 7B models at once uses roughly 2x the RAM. You can control this with OLLAMA_KEEP_ALIVE.

What about Apple Silicon (M1/M2/M3/M4)?

Apple Silicon uses unified memory shared between CPU and GPU. Your 64 GB M3 Max is equivalent to having 64 GB VRAM. This changes everything: a 70B Q4 model needs 40 GB of unified memory, and you still need room for the OS. A 64 GB Mac can run 70B models natively, which no x86 GPU under $3000 can match.

Ready to Build Your Local LLM Rig?

Need help picking a GPU too? See our Best GPUs for Deep Learning guide.