You downloaded a 70B model, it does not fit in your VRAM, and now you are staring at a list of files with names like Q4_K_M, GPTQ-4bit, and AWQ. This guide explains what each format actually does, which one to use for your setup, and exactly how much quality you lose at each quantization level.
Quick Navigation:
What Is Quantization?
Full-precision LLMs store each weight as a 32-bit or 16-bit floating point number. A 7B parameter model at FP16 uses about 14 GB. Quantization reduces the number of bits used per weight, shrinking the model at the cost of some precision.
| Precision | Bits per Weight | 7B Model Size | 70B Model Size | Quality |
|---|---|---|---|---|
| FP32 | 32 | 28 GB | 280 GB | Reference |
| FP16 / BF16 | 16 | 14 GB | 140 GB | Near-perfect |
| Q8 / 8-bit | 8 | ~7 GB | ~70 GB | Excellent |
| Q4 / 4-bit | 4 | ~4 GB | ~40 GB | Good |
| Q3 / 3-bit | 3 | ~3 GB | ~30 GB | Noticeable degradation |
| Q2 / 2-bit | 2 | ~2 GB | ~20 GB | Significant degradation |
Key insight: A larger model at lower precision almost always beats a smaller model at higher precision. A 70B Q4 model outperforms a 7B FP16 model by a wide margin, despite similar file sizes. Go as large as your VRAM allows, then reduce precision. Not the other way around.
The Three Main Formats
GGUF (llama.cpp format)
GGUF is the format used by llama.cpp, Ollama, and LM Studio. It replaced the older GGML format and is the most flexible option for local inference.
Strengths
- + Runs on CPU, GPU, or both (CPU offloading)
- + Single file format, easy to manage
- + Supports mixed quantization (different layers at different precision)
- + Best ecosystem support (Ollama, LM Studio, llama.cpp)
- + Works on NVIDIA, AMD, Apple Silicon, and CPU-only
Limitations
- - Slower GPU inference than GPTQ/AWQ on NVIDIA hardware
- - Not used for training, only inference
- - Less optimal for high-throughput server use cases
GGUF quantization variants explained:
| Variant | What it means | Best for |
|---|---|---|
| Q4_K_M | 4-bit with K-quantization, medium size. Most important layers get higher precision. | Best all-round choice |
| Q4_K_S | 4-bit K-quant, smaller. Slightly less quality than Q4_K_M for a smaller file. | Tight VRAM budgets |
| Q5_K_M | 5-bit K-quant. Noticeably better quality than Q4, smaller than Q8. | When you have headroom over Q4 |
| Q8_0 | 8-bit quantization. Near-FP16 quality, 2x the size of Q4. | When quality matters most |
| Q2_K / Q3_K | Very aggressive quantization. Quality drops noticeably, especially on complex reasoning. | Last resort for tiny VRAM |
GPTQ (GPU-optimized quantization)
GPTQ was one of the first practical 4-bit quantization methods for LLMs. It uses a calibration dataset during quantization to minimize error, producing better quality than naive rounding.
Strengths
- + Fast GPU inference, especially on NVIDIA
- + Widely available on HuggingFace (TheBloke era)
- + Works with transformers and text-generation-webui
- + Better quality than naive INT4 at same bit width
Limitations
- - NVIDIA GPU required (no CPU offloading)
- - Slower quantization process than AWQ
- - Being superseded by AWQ for most use cases
- - Less actively developed than AWQ and GGUF
AWQ (Activation-aware Weight Quantization)
AWQ improves on GPTQ by identifying which weights are most important for model quality (based on activation patterns) and preserving those at higher precision while aggressively quantizing the rest.
Strengths
- + Better quality than GPTQ at the same bit width
- + Fast GPU inference, faster than GPTQ
- + Preferred format for vLLM and production serving
- + Actively developed, growing HuggingFace availability
Limitations
- - NVIDIA GPU required (no CPU offloading)
- - Less tool support than GGUF outside of vLLM/HF
- - Fewer model variants available vs GGUF
Quality Loss: How Much Do You Actually Lose?
Benchmarks use perplexity (lower is better) as a proxy for quality degradation. For practical use, the difference between Q8 and Q4_K_M is hard to notice in conversation. The difference between Q4 and Q2 is very noticeable on complex reasoning tasks.
| Format | Perplexity vs FP16 | Noticeable in chat? | Noticeable in coding? |
|---|---|---|---|
| Q8_0 (GGUF) | +0.1-0.2% | No | No |
| Q5_K_M (GGUF) | +0.3-0.5% | No | Rarely |
| Q4_K_M (GGUF) | +0.5-1.0% | Rarely | Sometimes |
| AWQ 4-bit | +0.4-0.8% | Rarely | Sometimes |
| GPTQ 4-bit | +0.6-1.2% | Sometimes | Sometimes |
| Q3_K_M (GGUF) | +2-4% | Yes | Yes |
| Q2_K (GGUF) | +8-15% | Clearly | Clearly |
Pro tip: The quality impact of quantization is smaller on larger models. A 70B Q4 model degrades less from quantization than a 7B Q4 model, because the larger model has more redundancy in its weights. This is another reason to prioritize model size over precision.
Which Format Should You Use?
Using Ollama or LM Studio
Use GGUF. Both tools use llama.cpp under the hood and pull GGUF models automatically. The default Ollama models are already Q4_K_M. You do not need to think about this: just pull the model and run it.
Using vLLM for serving or high throughput
Use AWQ. vLLM has excellent AWQ support and it gives the best quality-to-speed ratio for GPU-only inference. If AWQ is not available for your model, GPTQ is a reasonable fallback.
Using HuggingFace transformers directly
Use AWQ or GPTQ via the bitsandbytes or auto-gptq libraries, or load with load_in_4bit=True for on-the-fly quantization. For fine-tuning with QLoRA, on-the-fly 4-bit quantization via bitsandbytes is the standard approach.
No GPU (CPU only)
Use GGUF only. It is the only format that supports CPU inference efficiently. Use Q4_K_M for the best balance of quality and speed on CPU. Avoid Q8 on CPU for large models, as it will be painfully slow.
Quick Decision Guide
| Your Setup | Use This | Specific Variant |
|---|---|---|
| Ollama on any hardware | GGUF | Default (Q4_K_M already set) |
| LM Studio | GGUF | Q4_K_M or Q5_K_M |
| NVIDIA GPU, want max speed | AWQ | 4-bit AWQ |
| NVIDIA GPU, AWQ not available | GPTQ | 4-bit GPTQ |
| CPU only | GGUF | Q4_K_M |
| Fine-tuning with QLoRA | bitsandbytes | load_in_4bit=True |
| Production inference server | AWQ via vLLM | 4-bit AWQ |
Frequently Asked Questions
Why does Ollama show different sizes for the same model?
Ollama uses different GGUF quantization levels depending on your hardware. On systems with limited VRAM it may default to Q4_K_M; with more VRAM available it might use Q5_K_M or higher. You can specify the exact variant when pulling: ollama pull llama3:70b-instruct-q5_K_M.
Is GGUF always slower than GPTQ/AWQ on NVIDIA?
For pure GPU inference, yes. GPTQ and AWQ are more GPU-optimized. But GGUF’s speed is competitive for most users and its CPU offloading capability means you can run larger models than fit in VRAM. For a 24 GB GPU running a 40 GB model partially in RAM, GGUF is the only practical option.
Where do I find quantized models?
HuggingFace is the main source. Search for the model name plus the format (e.g. “Llama-3.1-70B GGUF”). For GGUF, Bartowski and the original model authors often publish quantized versions. Ollama’s model library handles this automatically for popular models.
Can I quantize a model myself?
Yes. For GGUF: use llama.cpp’s convert.py and quantize tools. For AWQ: use the autoawq library. For GPTQ: use auto-gptq. You need the original FP16 weights from HuggingFace and enough RAM to load them (the full FP16 model must fit in RAM during conversion).
Related Reading
Need More VRAM for Bigger Models?