Meta’s Llama 4 introduced two open-weight models: Scout (109B parameters, 16 experts) and Maverick (400B parameters, 128 experts). Both use Mixture-of-Experts (MoE) architecture, activating only 17B parameters per token, which makes local inference surprisingly feasible. This guide covers the exact hardware you need.
Quick Navigation:
Understanding Llama 4 Architecture
Why MoE matters for your hardware: Unlike dense models where all parameters are used for every token, Llama 4’s MoE architecture only activates a fraction of the model per token. This means inference is much faster than the total parameter count suggests, but you still need enough VRAM to load the full model weights.
Scout
Maverick
VRAM Requirements by Quantization
Quantization reduces the precision of model weights to fit in less VRAM. Lower bit precision means smaller models with a small quality trade-off. Q4_K_M offers the best balance of quality and memory savings for most users.
Llama 4 Scout VRAM
| Precision | Model Size | VRAM Needed | GPU Setup |
|---|---|---|---|
| FP16 (full) | ~216 GB | ~232 GB | 4x H100 80GB |
| INT8 | ~109 GB | ~117 GB | 2x H100 80GB |
| INT4 / Q4_K_M | ~55 GB | ~63 GB | 2x RTX 5090 32GB |
| 2-bit (Q2_K) | ~27 GB | ~35 GB | 1x RTX 5090 32GB |
| 1.78-bit (Unsloth) | ~24 GB | ~24 GB | 1x RTX 4090 / 5090 |
Llama 4 Maverick VRAM
| Precision | Model Size | VRAM Needed | GPU Setup |
|---|---|---|---|
| FP16 (full) | ~800 GB | ~816 GB | 7x H200 141GB |
| INT8 | ~400 GB | ~416 GB | 5x H200 141GB |
| INT4 / Q4_K_M | ~200 GB | ~216 GB | 3x H100 80GB |
| 2-bit (Q2_K) | ~100 GB | ~116 GB | 4x RTX 5090 32GB |
| 1.78-bit (Unsloth) | ~89 GB | ~96 GB | 2x RTX 4090 48GB* |
Key insight: Scout is the practical local model. At Q4 quantization, it fits on 2x RTX 5090s with room for context. At aggressive 1.78-bit quantization (via Unsloth), it squeezes into a single 24GB GPU. Maverick requires enterprise-class hardware for most quantization levels.
GPU Recommendations
Best Single-GPU Option
Recommended: RTX 5090 (32 GB)
Runs Scout at 2-bit quantization with ~20 tokens/sec. The 32GB VRAM and 1.8TB/s memory bandwidth make it the best consumer GPU for local LLMs right now.
Best Value Option
Recommended: RTX 4090 (24 GB) (used market)
Runs Scout at 1.78-bit via Unsloth GGUF at ~15-20 tokens/sec. Available for $1,200-1,400 on the used market as users upgrade to 50-series.
Best Multi-GPU Setup
Recommended: 2x RTX 5090 (64 GB total)
Runs Scout at Q4_K_M with excellent quality and ~30-40 tokens/sec. Requires a motherboard with 2x PCIe 5.0 x16 slots and a 1600W+ PSU.
Performance Benchmarks
Scout Inference Speed by GPU
| GPU | Quantization | Tokens/sec | Usable? |
|---|---|---|---|
| RTX 5090 (32GB) | Q2_K / 1.78-bit | ~20-25 tok/s | Comfortable |
| 2x RTX 5090 (64GB) | Q4_K_M | ~30-40 tok/s | Excellent |
| RTX 4090 (24GB) | 1.78-bit (Unsloth) | ~15-20 tok/s | Workable |
| H100 (80GB) | INT8 | ~80-109 tok/s | Fast |
Note: For comfortable conversational use, aim for 15+ tokens/sec. Below 10 tok/s feels sluggish. These benchmarks use llama.cpp and vLLM. Actual performance varies by context length, system RAM, and CPU.
Recommended Software Stack
For Consumer GPUs
For Multi-GPU / Enterprise
Ready to Build Your Llama 4 Rig?
Need a multi-GPU workstation for Scout Q4? Check our Tailored Builds page.