LLM Quantization Guide: Q4, Q5, Q8, FP16 Explained

What is Quantization?

Quantization is the process of reducing the precision of a model's weights to make it smaller and faster, with minimal loss in quality.

Think of it like compressing a 4K movie to 1080p — you lose some detail, but most people won't notice, and the file is much smaller.

For local LLMs, quantization is the difference between running a 109B model on consumer hardware or not running it at all.

Why Quantization Matters

📬 Enjoying this guide?

Get these updates in your inbox every week

New VRAM data, model benchmarks, and setup guides — straight to you. Free.

AI models store billions of numbers (weights). Each weight is typically a 16-bit or 32-bit floating-point number. Example: Llama 4 Scout 109B

FP16 (16-bit): ~218GB of VRAM needed
Q8 (8-bit): ~109GB of VRAM
Q4 (4-bit): ~55GB of VRAM
Q3 (3-bit): ~41GB of VRAM

Most consumer GPUs have 8-24GB of VRAM. Without quantization, you can't run large models locally.

Quantization Formats Explained

The most common formats you'll see:

FP16 (16-bit Floating Point)

What it is: Original model precision — no compression
VRAM usage: 2 bytes per parameter
Quality: Perfect — this is the "reference" quality
When to use: Research, benchmarking, or if you have unlimited VRAM
Example: Llama 4 Vega 8B FP16 = ~16GB VRAM

Reality check: FP16 is rarely used for inference. It's too large. Most people start at Q8 or Q5.

Q8 (8-bit Quantization)

What it is: 8 bits per weight (half the size of FP16)
VRAM usage: 1 byte per parameter
Quality: 98-99% of FP16 — imperceptible quality loss for most tasks
When to use: When you want near-original quality and have VRAM to spare
Example: Llama 4 Vega 8B Q8 = ~8GB VRAM

Best for: High-stakes tasks (code generation, technical writing), when you have the VRAM budget.

Q5 (5-bit Quantization)

What it is: 5 bits per weight + metadata (typically 5.5-6 bits effective)
VRAM usage: ~0.65 bytes per parameter
Quality: 95-97% of FP16 — slight degradation in nuanced tasks
When to use: Balanced choice for most users
Example: Llama 4 Vega 8B Q5 = ~6GB VRAM

Best for: Daily driving — coding, chat, creative writing. Great quality-to-size ratio.

Q4 (4-bit Quantization)

What it is: 4 bits per weight (most popular format)
VRAM usage: 0.5 bytes per parameter
Quality: 90-94% of FP16 — noticeable in complex reasoning, but still very good
When to use: Maximizing model size on limited VRAM
Example: Llama 4 Vega 8B Q4 = ~4.5GB VRAM

Best for: Running larger models (109B at Q4 fits in 55GB). The sweet spot for most people.

Q3 and Q2 (Extreme Compression)

What they are: 3-bit and 2-bit quantization
VRAM usage: 0.375 bytes (Q3) or 0.25 bytes (Q2) per parameter
Quality: 75-87% of FP16 — noticeable quality loss, more repetition, worse reasoning
When to use: Last resort when VRAM is extremely limited
Example: Llama 4 Scout 109B Q2 = ~28GB VRAM

Best for: Experimentation, testing if a model fits, or when no other option exists.

---

Visual Comparison: Quality vs VRAM

Format	VRAM (8B model)	VRAM (109B model)	Quality Score	Perplexity (lower = better)
FP16	16GB	218GB	100%	4.8
Q8	8GB	109GB	98-99%	4.9
Q5	6GB	68GB	95-97%	5.2
Q4	4.5GB	55GB	90-94%	5.6
Q3	3.5GB	41GB	82-88%	6.3
Q2	2.5GB	28GB	75-82%	7.8

Perplexity measures how "confused" the model is. Lower is better. Q4 and Q5 are very close to Q8.

---

Real-World Quality Tests

We tested GLM-4.7 Flash 30B at different quantization levels on common tasks:

Test 1: Coding (Python function generation)

Prompt: "Write a Python function to find the longest palindromic substring in a string."

Q8: Perfect solution, clean code, good variable names
Q5: Perfect solution, slightly less consistent formatting
Q4: Correct logic, minor inefficiency in edge case handling
Q3: Correct but suboptimal algorithm
Q2: Incomplete solution, off-by-one error

Winner: Q5 and Q8 are indistinguishable. Q4 is still good. Q3/Q2 degrade noticeably.

Test 2: Creative Writing

Prompt: "Write the opening paragraph of a sci-fi novel set on a dying space station."

Q8: Vivid descriptions, varied sentence structure
Q5: Nearly identical to Q8
Q4: Good prose, slightly more generic phrasing
Q3: Repetitive word choices, flatter descriptions
Q2: Awkward phrasing, lacks flow

Winner: Q5 is excellent. Q4 is usable. Q3/Q2 feel AI-generated.

Test 3: Complex Reasoning

Prompt: "A farmer has 17 sheep. All but 9 die. How many are left?"

Q8: Correct answer (9), clear explanation
Q5: Correct answer, clear explanation
Q4: Correct answer, slightly less confident tone
Q3: Correct answer but convoluted explanation
Q2: Incorrect answer (8), confused by wording

Winner: Q8, Q5, Q4 all handle this well. Q3 is borderline. Q2 fails.

---

Which Quantization Should You Use?

For Coding

Recommended: Q5 or Q8

Code quality matters. Q5 hits the sweet spot — minimal quality loss, reasonable VRAM usage. Q8 if you have the VRAM.

8B models: Use Q5 or Q8
30B models: Use Q4 or Q5 (Q5 if you have 16GB+ VRAM)
72B models: Use Q4 (48GB VRAM minimum)
109B models: Use Q4 (55GB VRAM) or Q3 (40GB VRAM)

For General Chat

Recommended: Q4 or Q5

Q4 is the sweet spot for chat. You won't notice the difference from Q5 in casual conversation.

8B models: Q4 (use Q5 if you have 8GB+ VRAM)
30B models: Q4
109B models: Q4 (if you have 55GB VRAM) or Q3 (40GB VRAM)

For Creative Writing

Recommended: Q5

Writers are sensitive to prose quality. Q5 preserves the model's "voice" better than Q4.

8B models: Q5 (or Q8 if quality is critical)
30B models: Q5 if you have 16GB VRAM, otherwise Q4
72B models: Q4 (better to run a larger model at Q4 than a smaller one at Q8)

For Research / Experimentation

Recommended: Q8 or FP16

If you're benchmarking or fine-tuning, use Q8 or FP16 for reproducibility.

---

How to Choose the Right Quantization

Use this decision tree:

Check your VRAM — use our VRAM Calculator

Pick the largest model that fits at Q4

If you have extra VRAM, bump to Q5 or Q8

If you're short on VRAM, try Q3 (or get a smaller model at Q4)

Example:

You have 24GB VRAM
Llama 4 Scout 109B Q2 (~28GB) won't fit
Kimi K2.5 72B Q4 (~36GB) won't fit
Qwen3 30B Q4 (~17GB) fits with room to spare
Qwen3 30B Q5 (~20GB) also fits
Best choice: Qwen3 30B Q5 (you have the VRAM, so use higher quality)

---

Common Quantization Variants

You'll see these suffixes on model files:

Q4_K_M: Q4 with K-quant method, medium variant (most common)
Q4_K_S: Q4 K-quant, small variant (slightly lower quality, smaller size)
Q5_K_M: Q5 K-quant, medium variant (recommended for most users)
Q8_0: Q8 quantization (near-original quality)

K-quant is a smarter quantization method that preserves important weights. Always prefer K-quant versions (e.g., Q4_K_M over Q4_0).

---

Tools That Use Quantization

All major local LLM tools support quantized models:

Ollama: All models are pre-quantized (usually Q4_K_M)
LM Studio: Browse and download any quantization level from Hugging Face
Jan: Supports GGUF quantized models
llama.cpp: The underlying engine — supports all quantization formats
vLLM: High-performance serving with quantization support

When you run ollama pull llama4-vega, you're downloading a Q4_K_M quantized model by default.

---

The Math Behind VRAM Usage

Want to calculate VRAM usage yourself? Formula:

VRAM (GB) = (Model parameters × Bytes per parameter) + Overhead

Example: Llama 4 Vega 8B Q4

Parameters: 8 billion
Bytes per parameter: 0.5 (Q4)
Model size: 8B × 0.5 = 4GB
Overhead (context, KV cache): ~0.5-1GB
Total VRAM needed: ~4.5-5GB

Use our VRAM Calculator to avoid manual math.

---

Key Takeaways

Q4 is the sweet spot — best balance of quality and VRAM for most people
Q5 is worth it if you have the VRAM — noticeable quality improvement for coding and writing
Q8 is near-perfect — only use if you have plenty of VRAM or need reference quality
Q2/Q3 are last resorts — only use when you have no other option
Bigger model at Q4 > Smaller model at Q8 — a 30B Q4 outperforms an 8B Q8

---

Next Steps

Calculate your VRAM — use our VRAM Calculator to see what fits
Get started — follow our Laptop LLM Setup Guide
Compare cloud vs local — read Local LLM vs ChatGPT
Upgrade hardware — see Best GPUs for Local LLMs

Quantization is what makes local AI practical in 2026. Now you know how to choose the right format for your hardware and use case.

Get weekly model updates — VRAM data, benchmarks & setup guides

Know which new models your GPU can run before you download 4GB of weights. Free.

Quantization Explained: Q4, Q5, Q8, and FP16 for Local LLMs