11 min read

Quantization Explained: Q4, Q5, Q8, and FP16 for Local LLMs

What quantization is, why it matters for VRAM, and when to use each level. Visual comparison of quality vs memory for every use case with 2026 models.

What is Quantization?

Quantization is the process of reducing the precision of a model's weights to make it smaller and faster, with minimal loss in quality.

Think of it like compressing a 4K movie to 1080p — you lose some detail, but most people won't notice, and the file is much smaller.

For local LLMs, quantization is the difference between running a 109B model on consumer hardware or not running it at all.

Why Quantization Matters

📬 Enjoying this guide?

Get these updates in your inbox every week

New VRAM data, model benchmarks, and setup guides — straight to you. Free.

AI models store billions of numbers (weights). Each weight is typically a 16-bit or 32-bit floating-point number. Example: Llama 4 Scout 109B

Most consumer GPUs have 8-24GB of VRAM. Without quantization, you can't run large models locally.

Quantization Formats Explained

The most common formats you'll see:

FP16 (16-bit Floating Point)

Reality check: FP16 is rarely used for inference. It's too large. Most people start at Q8 or Q5.

Q8 (8-bit Quantization)

Best for: High-stakes tasks (code generation, technical writing), when you have the VRAM budget.

Q5 (5-bit Quantization)

Best for: Daily driving — coding, chat, creative writing. Great quality-to-size ratio.

Q4 (4-bit Quantization)

Best for: Running larger models (109B at Q4 fits in 55GB). The sweet spot for most people.

Q3 and Q2 (Extreme Compression)

Best for: Experimentation, testing if a model fits, or when no other option exists.

---

Visual Comparison: Quality vs VRAM

FormatVRAM (8B model)VRAM (109B model)Quality ScorePerplexity (lower = better)
FP1616GB218GB100%4.8
Q88GB109GB98-99%4.9
Q56GB68GB95-97%5.2
Q44.5GB55GB90-94%5.6
Q33.5GB41GB82-88%6.3
Q22.5GB28GB75-82%7.8
Perplexity measures how "confused" the model is. Lower is better. Q4 and Q5 are very close to Q8.

---

Real-World Quality Tests

We tested GLM-4.7 Flash 30B at different quantization levels on common tasks:

Test 1: Coding (Python function generation)

Prompt: "Write a Python function to find the longest palindromic substring in a string." Winner: Q5 and Q8 are indistinguishable. Q4 is still good. Q3/Q2 degrade noticeably.

Test 2: Creative Writing

Prompt: "Write the opening paragraph of a sci-fi novel set on a dying space station." Winner: Q5 is excellent. Q4 is usable. Q3/Q2 feel AI-generated.

Test 3: Complex Reasoning

Prompt: "A farmer has 17 sheep. All but 9 die. How many are left?" Winner: Q8, Q5, Q4 all handle this well. Q3 is borderline. Q2 fails.

---

Which Quantization Should You Use?

For Coding

Recommended: Q5 or Q8

Code quality matters. Q5 hits the sweet spot — minimal quality loss, reasonable VRAM usage. Q8 if you have the VRAM.

For General Chat

Recommended: Q4 or Q5

Q4 is the sweet spot for chat. You won't notice the difference from Q5 in casual conversation.

For Creative Writing

Recommended: Q5

Writers are sensitive to prose quality. Q5 preserves the model's "voice" better than Q4.

For Research / Experimentation

Recommended: Q8 or FP16

If you're benchmarking or fine-tuning, use Q8 or FP16 for reproducibility.

---

How to Choose the Right Quantization

Use this decision tree:

  • Check your VRAM — use our VRAM Calculator
  • Pick the largest model that fits at Q4
  • If you have extra VRAM, bump to Q5 or Q8
  • If you're short on VRAM, try Q3 (or get a smaller model at Q4)
  • Example: ---

    Common Quantization Variants

    You'll see these suffixes on model files:

    K-quant is a smarter quantization method that preserves important weights. Always prefer K-quant versions (e.g., Q4_K_M over Q4_0).

    ---

    Tools That Use Quantization

    All major local LLM tools support quantized models:

    When you run ollama pull llama4-vega, you're downloading a Q4_K_M quantized model by default.

    ---

    The Math Behind VRAM Usage

    Want to calculate VRAM usage yourself? Formula:

    VRAM (GB) = (Model parameters × Bytes per parameter) + Overhead
    Example: Llama 4 Vega 8B Q4 Use our VRAM Calculator to avoid manual math.

    ---

    Key Takeaways

    ---

    Next Steps

    Quantization is what makes local AI practical in 2026. Now you know how to choose the right format for your hardware and use case.

    Get weekly model updates — VRAM data, benchmarks & setup guides

    Know which new models your GPU can run before you download 4GB of weights. Free.