Skip to main content
LLM Quantization Explained: GGUF vs GPTQ vs AWQ (2026 Guide)

LLM Quantization Explained: GGUF vs GPTQ vs AWQ (2026 Guide)


You downloaded a 70B model, it does not fit in your VRAM, and now you are staring at a list of files with names like Q4_K_M, GPTQ-4bit, and AWQ. This guide explains what each format actually does, which one to use for your setup, and exactly how much quality you lose at each quantization level.

What Is Quantization?

Full-precision LLMs store each weight as a 32-bit or 16-bit floating point number. A 7B parameter model at FP16 uses about 14 GB. Quantization reduces the number of bits used per weight, shrinking the model at the cost of some precision.

PrecisionBits per Weight7B Model Size70B Model SizeQuality
FP323228 GB280 GBReference
FP16 / BF161614 GB140 GBNear-perfect
Q8 / 8-bit8~7 GB~70 GBExcellent
Q4 / 4-bit4~4 GB~40 GBGood
Q3 / 3-bit3~3 GB~30 GBNoticeable degradation
Q2 / 2-bit2~2 GB~20 GBSignificant degradation

Key insight: A larger model at lower precision almost always beats a smaller model at higher precision. A 70B Q4 model outperforms a 7B FP16 model by a wide margin, despite similar file sizes. Go as large as your VRAM allows, then reduce precision. Not the other way around.

Advertisement

The Three Main Formats

GGUF (llama.cpp format)

GGUF is the format used by llama.cpp, Ollama, and LM Studio. It replaced the older GGML format and is the most flexible option for local inference.

Strengths

  • + Runs on CPU, GPU, or both (CPU offloading)
  • + Single file format, easy to manage
  • + Supports mixed quantization (different layers at different precision)
  • + Best ecosystem support (Ollama, LM Studio, llama.cpp)
  • + Works on NVIDIA, AMD, Apple Silicon, and CPU-only

Limitations

  • - Slower GPU inference than GPTQ/AWQ on NVIDIA hardware
  • - Not used for training, only inference
  • - Less optimal for high-throughput server use cases

GGUF quantization variants explained:

VariantWhat it meansBest for
Q4_K_M4-bit with K-quantization, medium size. Most important layers get higher precision.Best all-round choice
Q4_K_S4-bit K-quant, smaller. Slightly less quality than Q4_K_M for a smaller file.Tight VRAM budgets
Q5_K_M5-bit K-quant. Noticeably better quality than Q4, smaller than Q8.When you have headroom over Q4
Q8_08-bit quantization. Near-FP16 quality, 2x the size of Q4.When quality matters most
Q2_K / Q3_KVery aggressive quantization. Quality drops noticeably, especially on complex reasoning.Last resort for tiny VRAM

GPTQ (GPU-optimized quantization)

GPTQ was one of the first practical 4-bit quantization methods for LLMs. It uses a calibration dataset during quantization to minimize error, producing better quality than naive rounding.

Strengths

  • + Fast GPU inference, especially on NVIDIA
  • + Widely available on HuggingFace (TheBloke era)
  • + Works with transformers and text-generation-webui
  • + Better quality than naive INT4 at same bit width

Limitations

  • - NVIDIA GPU required (no CPU offloading)
  • - Slower quantization process than AWQ
  • - Being superseded by AWQ for most use cases
  • - Less actively developed than AWQ and GGUF

AWQ (Activation-aware Weight Quantization)

AWQ improves on GPTQ by identifying which weights are most important for model quality (based on activation patterns) and preserving those at higher precision while aggressively quantizing the rest.

Strengths

  • + Better quality than GPTQ at the same bit width
  • + Fast GPU inference, faster than GPTQ
  • + Preferred format for vLLM and production serving
  • + Actively developed, growing HuggingFace availability

Limitations

  • - NVIDIA GPU required (no CPU offloading)
  • - Less tool support than GGUF outside of vLLM/HF
  • - Fewer model variants available vs GGUF
Advertisement

Quality Loss: How Much Do You Actually Lose?

Benchmarks use perplexity (lower is better) as a proxy for quality degradation. For practical use, the difference between Q8 and Q4_K_M is hard to notice in conversation. The difference between Q4 and Q2 is very noticeable on complex reasoning tasks.

FormatPerplexity vs FP16Noticeable in chat?Noticeable in coding?
Q8_0 (GGUF)+0.1-0.2%NoNo
Q5_K_M (GGUF)+0.3-0.5%NoRarely
Q4_K_M (GGUF)+0.5-1.0%RarelySometimes
AWQ 4-bit+0.4-0.8%RarelySometimes
GPTQ 4-bit+0.6-1.2%SometimesSometimes
Q3_K_M (GGUF)+2-4%YesYes
Q2_K (GGUF)+8-15%ClearlyClearly

Pro tip: The quality impact of quantization is smaller on larger models. A 70B Q4 model degrades less from quantization than a 7B Q4 model, because the larger model has more redundancy in its weights. This is another reason to prioritize model size over precision.

Which Format Should You Use?

Using Ollama or LM Studio

Use GGUF. Both tools use llama.cpp under the hood and pull GGUF models automatically. The default Ollama models are already Q4_K_M. You do not need to think about this: just pull the model and run it.

Recommended: Q4_K_M for most modelsQ5_K_M if you have VRAM headroomQ8_0 for small models under 13B

Using vLLM for serving or high throughput

Use AWQ. vLLM has excellent AWQ support and it gives the best quality-to-speed ratio for GPU-only inference. If AWQ is not available for your model, GPTQ is a reasonable fallback.

First choice: AWQ 4-bitFallback: GPTQ 4-bit

Using HuggingFace transformers directly

Use AWQ or GPTQ via the bitsandbytes or auto-gptq libraries, or load with load_in_4bit=True for on-the-fly quantization. For fine-tuning with QLoRA, on-the-fly 4-bit quantization via bitsandbytes is the standard approach.

Inference: AWQ preferredQLoRA fine-tuning: bitsandbytes 4-bit

No GPU (CPU only)

Use GGUF only. It is the only format that supports CPU inference efficiently. Use Q4_K_M for the best balance of quality and speed on CPU. Avoid Q8 on CPU for large models, as it will be painfully slow.

Only option: GGUFBest variant: Q4_K_M

Quick Decision Guide

Your SetupUse ThisSpecific Variant
Ollama on any hardwareGGUFDefault (Q4_K_M already set)
LM StudioGGUFQ4_K_M or Q5_K_M
NVIDIA GPU, want max speedAWQ4-bit AWQ
NVIDIA GPU, AWQ not availableGPTQ4-bit GPTQ
CPU onlyGGUFQ4_K_M
Fine-tuning with QLoRAbitsandbytesload_in_4bit=True
Production inference serverAWQ via vLLM4-bit AWQ

Frequently Asked Questions

Why does Ollama show different sizes for the same model?

Ollama uses different GGUF quantization levels depending on your hardware. On systems with limited VRAM it may default to Q4_K_M; with more VRAM available it might use Q5_K_M or higher. You can specify the exact variant when pulling: ollama pull llama3:70b-instruct-q5_K_M.

Is GGUF always slower than GPTQ/AWQ on NVIDIA?

For pure GPU inference, yes. GPTQ and AWQ are more GPU-optimized. But GGUF’s speed is competitive for most users and its CPU offloading capability means you can run larger models than fit in VRAM. For a 24 GB GPU running a 40 GB model partially in RAM, GGUF is the only practical option.

Where do I find quantized models?

HuggingFace is the main source. Search for the model name plus the format (e.g. “Llama-3.1-70B GGUF”). For GGUF, Bartowski and the original model authors often publish quantized versions. Ollama’s model library handles this automatically for popular models.

Can I quantize a model myself?

Yes. For GGUF: use llama.cpp’s convert.py and quantize tools. For AWQ: use the autoawq library. For GPTQ: use auto-gptq. You need the original FP16 weights from HuggingFace and enough RAM to load them (the full FP16 model must fit in RAM during conversion).

Need More VRAM for Bigger Models?