Can I Run This Model?

Select your GPU. Instantly see which LLMs fit in your VRAM at every quantization level.

Configure Your Hardware
GB
💡 Shopping for a GPU? Enter the VRAM amount you're considering (e.g., 24 for RTX 4090) to see what models you can run.
Model Compatibility
Runs well
Tight fit
Won't fit
Model Params Q4_K_M Q5_K_M Q8_0 FP16
// How it works

VRAM Math, Explained

Every LLM has billions of parameters (weights). Quantization compresses those weights to use less VRAM, at a small quality cost.

Step 1

Count the Parameters

A "70B" model has 70 billion floating-point numbers. At full precision (FP16), each takes 2 bytes. That's 140 GB just for the weights.

Step 2

Apply Quantization

Quantization reduces precision. Q4_K_M uses ~0.59 bytes/param instead of 2.0. That 70B model drops from 140 GB to ~42 GB.

Step 3

Add Overhead

The runtime (Ollama, llama.cpp) needs ~1.5 GB extra for KV cache, context window, and CUDA/Metal. Final: weights + 1.5 GB.

Format Bytes/Param Size vs FP16 Quality
FP16 2.00 100%
Baseline
Q8_0 1.06 53%
Near-lossless
Q5_K_M 0.69 35%
Very good
Q4_K_M 0.59 30%
Good

Get weekly model updates — new VRAM data, benchmarks, and setup guides

When a new model drops, we add its VRAM requirements here. Get notified every week so you know before you download 40 GB of weights. 85+ models tracked. Free.

Join 400+ developers. Updated when new models drop. Unsubscribe anytime.