LLM Quantization Explained: GGUF vs GPTQ vs AWQ (2026 Guide)

You downloaded a 70B model, it does not fit in your VRAM, and now you are staring at a list of files with names like Q4_K_M, GPTQ-4bit, and AWQ. This guide explains what each format actually does, which one to use for your setup, and exactly how much quality you lose at each quantization level.

Quick Navigation:

What Is Quantization GGUF vs GPTQ vs AWQ Quality Loss Which to Use

What Is Quantization?

Full-precision LLMs store each weight as a 32-bit or 16-bit floating point number. A 7B parameter model at FP16 uses about 14 GB. Quantization reduces the number of bits used per weight, shrinking the model at the cost of some precision.

Precision	Bits per Weight	7B Model Size	70B Model Size	Quality
FP32	32	28 GB	280 GB	Reference
FP16 / BF16	16	14 GB	140 GB	Near-perfect
Q8 / 8-bit	8	~7 GB	~70 GB	Excellent
Q4 / 4-bit	4	~4 GB	~40 GB	Good
Q3 / 3-bit	3	~3 GB	~30 GB	Noticeable degradation
Q2 / 2-bit	2	~2 GB	~20 GB	Significant degradation

Key insight: A larger model at lower precision almost always beats a smaller model at higher precision. A 70B Q4 model outperforms a 7B FP16 model by a wide margin, despite similar file sizes. Go as large as your VRAM allows, then reduce precision. Not the other way around.

The Three Main Formats

GGUF (llama.cpp format)

GGUF is the format used by llama.cpp, Ollama, and LM Studio. It replaced the older GGML format and is the most flexible option for local inference.

Strengths

+ Runs on CPU, GPU, or both (CPU offloading)
+ Single file format, easy to manage
+ Supports mixed quantization (different layers at different precision)
+ Best ecosystem support (Ollama, LM Studio, llama.cpp)
+ Works on NVIDIA, AMD, Apple Silicon, and CPU-only

Limitations

- Slower GPU inference than GPTQ/AWQ on NVIDIA hardware
- Not used for training, only inference
- Less optimal for high-throughput server use cases

GGUF quantization variants explained:

Variant	What it means	Best for
Q4_K_M	4-bit with K-quantization, medium size. Most important layers get higher precision.	Best all-round choice
Q4_K_S	4-bit K-quant, smaller. Slightly less quality than Q4_K_M for a smaller file.	Tight VRAM budgets
Q5_K_M	5-bit K-quant. Noticeably better quality than Q4, smaller than Q8.	When you have headroom over Q4
Q8_0	8-bit quantization. Near-FP16 quality, 2x the size of Q4.	When quality matters most
Q2_K / Q3_K	Very aggressive quantization. Quality drops noticeably, especially on complex reasoning.	Last resort for tiny VRAM

GPTQ (GPU-optimized quantization)

GPTQ was one of the first practical 4-bit quantization methods for LLMs. It uses a calibration dataset during quantization to minimize error, producing better quality than naive rounding.

Strengths

+ Fast GPU inference, especially on NVIDIA
+ Widely available on HuggingFace (TheBloke era)
+ Works with transformers and text-generation-webui
+ Better quality than naive INT4 at same bit width

Limitations

- NVIDIA GPU required (no CPU offloading)
- Slower quantization process than AWQ
- Being superseded by AWQ for most use cases
- Less actively developed than AWQ and GGUF

AWQ (Activation-aware Weight Quantization)

AWQ improves on GPTQ by identifying which weights are most important for model quality (based on activation patterns) and preserving those at higher precision while aggressively quantizing the rest.

Strengths

+ Better quality than GPTQ at the same bit width
+ Fast GPU inference, faster than GPTQ
+ Preferred format for vLLM and production serving
+ Actively developed, growing HuggingFace availability

Limitations

- NVIDIA GPU required (no CPU offloading)
- Less tool support than GGUF outside of vLLM/HF
- Fewer model variants available vs GGUF

Quality Loss: How Much Do You Actually Lose?

Benchmarks use perplexity (lower is better) as a proxy for quality degradation. For practical use, the difference between Q8 and Q4_K_M is hard to notice in conversation. The difference between Q4 and Q2 is very noticeable on complex reasoning tasks.

Format	Perplexity vs FP16	Noticeable in chat?	Noticeable in coding?
Q8_0 (GGUF)	+0.1-0.2%	No	No
Q5_K_M (GGUF)	+0.3-0.5%	No	Rarely
Q4_K_M (GGUF)	+0.5-1.0%	Rarely	Sometimes
AWQ 4-bit	+0.4-0.8%	Rarely	Sometimes
GPTQ 4-bit	+0.6-1.2%	Sometimes	Sometimes
Q3_K_M (GGUF)	+2-4%	Yes	Yes
Q2_K (GGUF)	+8-15%	Clearly	Clearly

Pro tip: The quality impact of quantization is smaller on larger models. A 70B Q4 model degrades less from quantization than a 7B Q4 model, because the larger model has more redundancy in its weights. This is another reason to prioritize model size over precision.

Which Format Should You Use?

Using Ollama or LM Studio

Use GGUF. Both tools use llama.cpp under the hood and pull GGUF models automatically. The default Ollama models are already Q4_K_M. You do not need to think about this: just pull the model and run it.

Recommended: Q4_K_M for most modelsQ5_K_M if you have VRAM headroomQ8_0 for small models under 13B

Using vLLM for serving or high throughput

Use AWQ. vLLM has excellent AWQ support and it gives the best quality-to-speed ratio for GPU-only inference. If AWQ is not available for your model, GPTQ is a reasonable fallback.

First choice: AWQ 4-bitFallback: GPTQ 4-bit

Using HuggingFace transformers directly

Use AWQ or GPTQ via the bitsandbytes or auto-gptq libraries, or load with load_in_4bit=True for on-the-fly quantization. For fine-tuning with QLoRA, on-the-fly 4-bit quantization via bitsandbytes is the standard approach.

Inference: AWQ preferredQLoRA fine-tuning: bitsandbytes 4-bit

No GPU (CPU only)

Use GGUF only. It is the only format that supports CPU inference efficiently. Use Q4_K_M for the best balance of quality and speed on CPU. Avoid Q8 on CPU for large models, as it will be painfully slow.

Only option: GGUFBest variant: Q4_K_M

Quick Decision Guide

Your Setup	Use This	Specific Variant
Ollama on any hardware	GGUF	Default (Q4_K_M already set)
LM Studio	GGUF	Q4_K_M or Q5_K_M
NVIDIA GPU, want max speed	AWQ	4-bit AWQ
NVIDIA GPU, AWQ not available	GPTQ	4-bit GPTQ
CPU only	GGUF	Q4_K_M
Fine-tuning with QLoRA	bitsandbytes	load_in_4bit=True
Production inference server	AWQ via vLLM	4-bit AWQ

Frequently Asked Questions

Why does Ollama show different sizes for the same model?

Ollama uses different GGUF quantization levels depending on your hardware. On systems with limited VRAM it may default to Q4_K_M; with more VRAM available it might use Q5_K_M or higher. You can specify the exact variant when pulling: ollama pull llama3:70b-instruct-q5_K_M.

Is GGUF always slower than GPTQ/AWQ on NVIDIA?

For pure GPU inference, yes. GPTQ and AWQ are more GPU-optimized. But GGUF’s speed is competitive for most users and its CPU offloading capability means you can run larger models than fit in VRAM. For a 24 GB GPU running a 40 GB model partially in RAM, GGUF is the only practical option.

Where do I find quantized models?

HuggingFace is the main source. Search for the model name plus the format (e.g. “Llama-3.1-70B GGUF”). For GGUF, Bartowski and the original model authors often publish quantized versions. Ollama’s model library handles this automatically for popular models.

Can I quantize a model myself?

Yes. For GGUF: use llama.cpp’s convert.py and quantize tools. For AWQ: use the autoawq library. For GPTQ: use auto-gptq. You need the original FP16 weights from HuggingFace and enough RAM to load them (the full FP16 model must fit in RAM during conversion).

Need More VRAM for Bigger Models?

Compare GPUs by VRAM

RAM Requirements Guide

What Is Quantization?

The Three Main Formats

GGUF (llama.cpp format)

Strengths

Limitations

GPTQ (GPU-optimized quantization)

Strengths

Limitations

AWQ (Activation-aware Weight Quantization)

Strengths

Limitations

Quality Loss: How Much Do You Actually Lose?

Which Format Should You Use?

Using Ollama or LM Studio

Using vLLM for serving or high throughput

Using HuggingFace transformers directly

No GPU (CPU only)

Quick Decision Guide

Frequently Asked Questions

Why does Ollama show different sizes for the same model?

Is GGUF always slower than GPTQ/AWQ on NVIDIA?

Where do I find quantized models?

Can I quantize a model myself?

Related Reading

Need More VRAM for Bigger Models?

Cookie Settings