Select your GPU. Instantly see which LLMs fit in your VRAM at every quantization level.
| Model | Params | Q4_K_M | Q5_K_M | Q8_0 | FP16 |
|---|
Every LLM has billions of parameters (weights). Quantization compresses those weights to use less VRAM, at a small quality cost.
A "70B" model has 70 billion floating-point numbers. At full precision (FP16), each takes 2 bytes. That's 140 GB just for the weights.
Quantization reduces precision. Q4_K_M uses ~0.59 bytes/param instead of 2.0. That 70B model drops from 140 GB to ~42 GB.
The runtime (Ollama, llama.cpp) needs ~1.5 GB extra for KV cache, context window, and CUDA/Metal. Final: weights + 1.5 GB.
| Format | Bytes/Param | Size vs FP16 | Quality |
|---|---|---|---|
| FP16 | 2.00 | 100% | |
| Q8_0 | 1.06 | 53% | |
| Q5_K_M | 0.69 | 35% | |
| Q4_K_M | 0.59 | 30% |