// How it works

VRAM Math, Explained

Every LLM has billions of parameters (weights). Quantization compresses those weights to use less VRAM, at a small quality cost.

Step 1

A "70B" model has 70 billion floating-point numbers. At full precision (FP16), each takes 2 bytes. That's 140 GB just for the weights.

Step 2

Quantization reduces precision. Q4_K_M uses ~0.59 bytes/param instead of 2.0. That 70B model drops from 140 GB to ~42 GB.

Step 3

The runtime (Ollama, llama.cpp) needs ~1.5 GB extra for KV cache, context window, and CUDA/Metal. Final: weights + 1.5 GB.

Format

Bytes/Param

Size vs FP16

Quality

FP16

2.00

100%

Baseline

Q8_0

1.06

53%

Near-lossless

Q5_K_M

0.69

35%

Very good

Q4_K_M

0.59

30%

Good

Can I Run This Model?