Can I Run This Model?

Select your GPU. Instantly see which LLMs fit in your VRAM at every quantization level.

Select your GPU
GB VRAM
Model Compatibility
Runs well
Tight fit
Won't fit
Model Params Q4_K_M Q5_K_M Q8_0 FP16
// How it works

VRAM Math, Explained

Every LLM has billions of parameters (weights). Quantization compresses those weights to use less VRAM, at a small quality cost.

Step 1

Count the Parameters

A "70B" model has 70 billion floating-point numbers. At full precision (FP16), each takes 2 bytes. That's 140 GB just for the weights.

Step 2

Apply Quantization

Quantization reduces precision. Q4_K_M uses ~0.59 bytes/param instead of 2.0. That 70B model drops from 140 GB to ~42 GB.

Step 3

Add Overhead

The runtime (Ollama, llama.cpp) needs ~1.5 GB extra for KV cache, context window, and CUDA/Metal. Final: weights + 1.5 GB.

Format Bytes/Param Size vs FP16 Quality
FP16 2.00 100%
Baseline
Q8_0 1.06 53%
Near-lossless
Q5_K_M 0.69 35%
Very good
Q4_K_M 0.59 30%
Good

Get notified when we update this tool

New models and GPUs added regularly. Plus weekly local AI guides and benchmarks.