What is Quantization?
Quantization is the process of reducing the precision of a model's weights to make it smaller and faster, with minimal loss in quality.
Think of it like compressing a 4K movie to 1080p — you lose some detail, but most people won't notice, and the file is much smaller.
For local LLMs, quantization is the difference between running a 109B model on consumer hardware or not running it at all.
Why Quantization Matters
📬 Enjoying this guide?
Get these updates in your inbox every week
New VRAM data, model benchmarks, and setup guides — straight to you. Free.
AI models store billions of numbers (weights). Each weight is typically a 16-bit or 32-bit floating-point number. Example: Llama 4 Scout 109B
- FP16 (16-bit): ~218GB of VRAM needed
- Q8 (8-bit): ~109GB of VRAM
- Q4 (4-bit): ~55GB of VRAM
- Q3 (3-bit): ~41GB of VRAM
Quantization Formats Explained
The most common formats you'll see:
FP16 (16-bit Floating Point)
- What it is: Original model precision — no compression
- VRAM usage: 2 bytes per parameter
- Quality: Perfect — this is the "reference" quality
- When to use: Research, benchmarking, or if you have unlimited VRAM
- Example: Llama 4 Vega 8B FP16 = ~16GB VRAM
Q8 (8-bit Quantization)
- What it is: 8 bits per weight (half the size of FP16)
- VRAM usage: 1 byte per parameter
- Quality: 98-99% of FP16 — imperceptible quality loss for most tasks
- When to use: When you want near-original quality and have VRAM to spare
- Example: Llama 4 Vega 8B Q8 = ~8GB VRAM
Q5 (5-bit Quantization)
- What it is: 5 bits per weight + metadata (typically 5.5-6 bits effective)
- VRAM usage: ~0.65 bytes per parameter
- Quality: 95-97% of FP16 — slight degradation in nuanced tasks
- When to use: Balanced choice for most users
- Example: Llama 4 Vega 8B Q5 = ~6GB VRAM
Q4 (4-bit Quantization)
- What it is: 4 bits per weight (most popular format)
- VRAM usage: 0.5 bytes per parameter
- Quality: 90-94% of FP16 — noticeable in complex reasoning, but still very good
- When to use: Maximizing model size on limited VRAM
- Example: Llama 4 Vega 8B Q4 = ~4.5GB VRAM
Q3 and Q2 (Extreme Compression)
- What they are: 3-bit and 2-bit quantization
- VRAM usage: 0.375 bytes (Q3) or 0.25 bytes (Q2) per parameter
- Quality: 75-87% of FP16 — noticeable quality loss, more repetition, worse reasoning
- When to use: Last resort when VRAM is extremely limited
- Example: Llama 4 Scout 109B Q2 = ~28GB VRAM
---
Visual Comparison: Quality vs VRAM
| Format | VRAM (8B model) | VRAM (109B model) | Quality Score | Perplexity (lower = better) |
|---|---|---|---|---|
| FP16 | 16GB | 218GB | 100% | 4.8 |
| Q8 | 8GB | 109GB | 98-99% | 4.9 |
| Q5 | 6GB | 68GB | 95-97% | 5.2 |
| Q4 | 4.5GB | 55GB | 90-94% | 5.6 |
| Q3 | 3.5GB | 41GB | 82-88% | 6.3 |
| Q2 | 2.5GB | 28GB | 75-82% | 7.8 |
---
Real-World Quality Tests
We tested GLM-4.7 Flash 30B at different quantization levels on common tasks:
Test 1: Coding (Python function generation)
Prompt: "Write a Python function to find the longest palindromic substring in a string."- Q8: Perfect solution, clean code, good variable names
- Q5: Perfect solution, slightly less consistent formatting
- Q4: Correct logic, minor inefficiency in edge case handling
- Q3: Correct but suboptimal algorithm
- Q2: Incomplete solution, off-by-one error
Test 2: Creative Writing
Prompt: "Write the opening paragraph of a sci-fi novel set on a dying space station."- Q8: Vivid descriptions, varied sentence structure
- Q5: Nearly identical to Q8
- Q4: Good prose, slightly more generic phrasing
- Q3: Repetitive word choices, flatter descriptions
- Q2: Awkward phrasing, lacks flow
Test 3: Complex Reasoning
Prompt: "A farmer has 17 sheep. All but 9 die. How many are left?"- Q8: Correct answer (9), clear explanation
- Q5: Correct answer, clear explanation
- Q4: Correct answer, slightly less confident tone
- Q3: Correct answer but convoluted explanation
- Q2: Incorrect answer (8), confused by wording
---
Which Quantization Should You Use?
For Coding
Recommended: Q5 or Q8Code quality matters. Q5 hits the sweet spot — minimal quality loss, reasonable VRAM usage. Q8 if you have the VRAM.
- 8B models: Use Q5 or Q8
- 30B models: Use Q4 or Q5 (Q5 if you have 16GB+ VRAM)
- 72B models: Use Q4 (48GB VRAM minimum)
- 109B models: Use Q4 (55GB VRAM) or Q3 (40GB VRAM)
For General Chat
Recommended: Q4 or Q5Q4 is the sweet spot for chat. You won't notice the difference from Q5 in casual conversation.
- 8B models: Q4 (use Q5 if you have 8GB+ VRAM)
- 30B models: Q4
- 109B models: Q4 (if you have 55GB VRAM) or Q3 (40GB VRAM)
For Creative Writing
Recommended: Q5Writers are sensitive to prose quality. Q5 preserves the model's "voice" better than Q4.
- 8B models: Q5 (or Q8 if quality is critical)
- 30B models: Q5 if you have 16GB VRAM, otherwise Q4
- 72B models: Q4 (better to run a larger model at Q4 than a smaller one at Q8)
For Research / Experimentation
Recommended: Q8 or FP16If you're benchmarking or fine-tuning, use Q8 or FP16 for reproducibility.
---
How to Choose the Right Quantization
Use this decision tree:
- You have 24GB VRAM
- Llama 4 Scout 109B Q2 (~28GB) won't fit
- Kimi K2.5 72B Q4 (~36GB) won't fit
- Qwen3 30B Q4 (~17GB) fits with room to spare
- Qwen3 30B Q5 (~20GB) also fits
- Best choice: Qwen3 30B Q5 (you have the VRAM, so use higher quality)
Common Quantization Variants
You'll see these suffixes on model files:
- Q4_K_M: Q4 with K-quant method, medium variant (most common)
- Q4_K_S: Q4 K-quant, small variant (slightly lower quality, smaller size)
- Q5_K_M: Q5 K-quant, medium variant (recommended for most users)
- Q8_0: Q8 quantization (near-original quality)
---
Tools That Use Quantization
All major local LLM tools support quantized models:
- Ollama: All models are pre-quantized (usually Q4_K_M)
- LM Studio: Browse and download any quantization level from Hugging Face
- Jan: Supports GGUF quantized models
- llama.cpp: The underlying engine — supports all quantization formats
- vLLM: High-performance serving with quantization support
ollama pull llama4-vega, you're downloading a Q4_K_M quantized model by default.
---
The Math Behind VRAM Usage
Want to calculate VRAM usage yourself? Formula:
VRAM (GB) = (Model parameters × Bytes per parameter) + Overhead
Example: Llama 4 Vega 8B Q4
- Parameters: 8 billion
- Bytes per parameter: 0.5 (Q4)
- Model size: 8B × 0.5 = 4GB
- Overhead (context, KV cache): ~0.5-1GB
- Total VRAM needed: ~4.5-5GB
---
Key Takeaways
- Q4 is the sweet spot — best balance of quality and VRAM for most people
- Q5 is worth it if you have the VRAM — noticeable quality improvement for coding and writing
- Q8 is near-perfect — only use if you have plenty of VRAM or need reference quality
- Q2/Q3 are last resorts — only use when you have no other option
- Bigger model at Q4 > Smaller model at Q8 — a 30B Q4 outperforms an 8B Q8
Next Steps
- Calculate your VRAM — use our VRAM Calculator to see what fits
- Get started — follow our Laptop LLM Setup Guide
- Compare cloud vs local — read Local LLM vs ChatGPT
- Upgrade hardware — see Best GPUs for Local LLMs
Get weekly model updates — VRAM data, benchmarks & setup guides
Know which new models your GPU can run before you download 4GB of weights. Free.