GPU memory is the single biggest bottleneck when running LLMs locally. Buy too little and your model crashes with an out-of-memory error. Buy too much and you've wasted hundreds of dollars on headroom you'll never use.
This article gives you the formula, a quick-reference table for 15 popular models, and a breakdown of what you can realistically run at each VRAM tier. Use the VRAM calculator to get an instant answer for your specific GPU.
Why VRAM Is the Bottleneck
When you load an LLM, every single model weight must fit in GPU memory before inference can begin. Unlike system RAM, VRAM is a fixed pool — the GPU cannot swap weights in and out from disk during a forward pass without grinding to a halt.
Three things compete for that pool:
- Model weights — the largest chunk by far
- KV cache — grows with context length; a 32K context window can rival the weights themselves
- Activations & overhead — typically 1–2 GB for the framework (Ollama, llama.cpp, vLLM)
The VRAM Formula
📬 Enjoying this guide?
Get these updates in your inbox every week
New VRAM data, model benchmarks, and setup guides — straight to you. Free.
The math is straightforward:
VRAM (GB) = (Parameters × Bytes per Parameter) + KV Cache + Overhead
Bytes per parameter by precision:
| Precision | Bytes / Param | Notes |
|---|---|---|
| FP32 | 4 bytes | Training only; never used for inference |
| FP16 / BF16 | 2 bytes | Standard full-precision inference |
| Q8_0 | ~1 byte | 8-bit quantization |
| Q5_K_M | ~0.625 bytes | 5-bit quantization |
| Q4_K_M | ~0.5 bytes | 4-bit quantization; most popular balance |
| Q2_K | ~0.25 bytes | 2-bit; significant quality loss |
Weights: 109B × 0.5 bytes = 54.5 GB
BUT: Scout is MoE — only 17B params are active at inference.
Effective weight VRAM: ~17B × 0.5 bytes ≈ 8.5 GB
KV cache (4K context): ~0.5 GB
Overhead: ~1 GB
Total: ~10 GBThe MoE caveat is important — we cover it in detail below. Worked example — Mistral 7B at FP16:
Weights: 7B × 2 bytes = 14 GB
KV cache (4K context): ~0.3 GB
Overhead: ~1 GB
Total: ~15.3 GB → needs 16 GB VRAMAt Q4_K_M: 7B × 0.5 bytes ≈ 3.5 GB + overhead ≈ 5 GB — fits on any modern laptop GPU.
Quick Reference: Top 15 Models
VRAM requirements at three common quantization levels. Numbers include ~1–1.5 GB of framework overhead. Use the VRAM calculator for exact figures on your hardware.
| Model | Params | Q4_K_M | Q8_0 | FP16 |
|---|---|---|---|---|
| Llama 3.2 3B | 3B | ~2.5 GB | ~4 GB | ~7 GB |
| Mistral 7B | 7B | ~5 GB | ~8.5 GB | ~15 GB |
| Llama 3.1 8B | 8B | ~5.5 GB | ~9.5 GB | ~17 GB |
| Gemma 3 9B | 9B | ~6 GB | ~10.5 GB | ~19 GB |
| Llama 3.1 70B | 70B | ~40 GB | ~75 GB | ~140 GB |
| Qwen 2.5 7B | 7B | ~5 GB | ~8.5 GB | ~15 GB |
| Qwen 2.5 72B | 72B | ~42 GB | ~78 GB | ~144 GB |
| Phi-4 14B | 14B | ~9 GB | ~15.5 GB | ~29 GB |
| DeepSeek-R1 7B | 7B | ~5 GB | ~8.5 GB | ~15 GB |
| DeepSeek-R1 14B | 14B | ~9 GB | ~15.5 GB | ~29 GB |
| DeepSeek-R1 32B | 32B | ~20 GB | ~35 GB | ~65 GB |
| Llama 4 Scout | 109B (MoE) | ~10 GB | ~19 GB | ~36 GB |
| Mixtral 8x7B | 56B (MoE) | ~26 GB | ~47 GB | ~90 GB |
| Gemma 3 27B | 27B | ~17 GB | ~29 GB | ~55 GB |
| Command-R 35B | 35B | ~22 GB | ~38 GB | ~70 GB |
Budget GPU Guide: What Fits Where
8 GB VRAM (RTX 3060, RX 6600 XT, RTX 4060)
The entry-level tier for serious local LLM use. Covers most popular 7–8B models at Q4. Fits comfortably:
- Mistral 7B Q4_K_M (~5 GB)
- Llama 3.1 8B Q4_K_M (~5.5 GB)
- Qwen 2.5 7B Q4_K_M (~5 GB)
- DeepSeek-R1 7B Q4_K_M (~5 GB)
- Llama 3.2 3B FP16 (~7 GB)
12 GB VRAM (RTX 3060 12GB, RTX 4070, RX 7700 XT)
A significant step up. Now 13B models work, and you can run 7B models at Q8 for noticeably better output quality. Fits comfortably:
- Mistral 7B Q8_0 (~8.5 GB)
- Llama 3.1 8B Q8_0 (~9.5 GB)
- Phi-4 14B Q4_K_M (~9 GB)
- DeepSeek-R1 14B Q4_K_M (~9 GB)
16 GB VRAM (RTX 3080, RTX 4080 Super, RX 7900 GRE)
The sweet spot for power users. Unlocks 14–15B models at Q8 and the 32B class at heavy quantization. Fits comfortably:
- Phi-4 14B Q8_0 (~15.5 GB — tight)
- DeepSeek-R1 14B Q8_0 (~15.5 GB — tight)
- Llama 4 Scout Q4_K_M (~10 GB)
- Gemma 3 27B Q4_K_M (~17 GB — tight, may need to reduce context)
24 GB VRAM (RTX 3090, RTX 4090, RTX A5000)
Consumer flagship territory. Most models under 30B run well. DeepSeek-R1 32B Q4 fits. Fits comfortably:
- Gemma 3 27B Q8_0 (~29 GB — tight on 24 GB, better at Q5)
- DeepSeek-R1 32B Q4_K_M (~20 GB)
- Command-R 35B Q4_K_M (~22 GB)
- Llama 4 Scout Q8_0 (~19 GB)
- Mixtral 8x7B Q4_K_M (~26 GB — 1–2 GB over; offload a few layers to CPU)
48 GB VRAM (RTX A6000, RTX 6000 Ada, Dual 3090 NVLink)
Workstation / prosumer tier. Llama 3.1 70B runs at Q4 (~40 GB). Qwen 2.5 72B Q4 (~42 GB) just fits. Fits comfortably:
- Llama 3.1 70B Q4_K_M (~40 GB)
- Qwen 2.5 72B Q4_K_M (~42 GB)
- Mixtral 8x7B FP16 (~90 GB — needs dual 48 GB)
MoE Models: Why the Numbers Look Wrong
Mixture-of-Experts (MoE) models like Mixtral 8x7B and Llama 4 Scout have a confusing parameter count. Mixtral 8x7B has 56 billion total parameters, but only about 13 billion are active on any given forward pass. The router selects 2 of 8 expert networks per token, keeping the rest dormant. What this means for VRAM:
- All weights must still load into VRAM — you need room for all 56B (or 109B for Scout) weight files
- Active computation only touches ~13B (Mixtral) or ~17B (Scout) worth of weights per token
- The result: compute speed feels like a 13B model, but memory footprint is closer to a 30B dense model
| Model | Total Params | Active Params | VRAM at Q4 | Speed Feel |
|---|---|---|---|---|
| Mixtral 8x7B | 56B | ~13B | ~26 GB | Like a 13B |
| Llama 4 Scout | 109B | 17B | ~10 GB | Like a 17B |
| Llama 4 Maverick | 400B | 17B | ~20 GB | Like a 17B |
Quantization Tradeoffs
Quantization reduces precision to shrink model size. The key question is: how much quality do you actually lose?
| Quantization | VRAM vs FP16 | Quality Loss | Recommendation |
|---|---|---|---|
| FP16 | 1× (baseline) | None | Benchmarks, fine-tuning |
| Q8_0 | ~0.5× | <0.5% | When VRAM allows; near-lossless |
| Q5_K_M | ~0.38× | ~1–2% | Excellent balance for 12–16 GB cards |
| Q4_K_M | ~0.31× | ~3–5% | Default for most users; barely noticeable |
| Q3_K_M | ~0.24× | ~8–12% | Avoid unless VRAM-constrained |
| Q2_K | ~0.19× | >20% | Last resort; noticeable degradation |
For a deeper dive on what these formats mean technically, see the quantization guide.
Check Your GPU — VRAM Calculator
Rather than doing the math manually, use the VRAM calculator. Select your model, pick your GPU, and it tells you instantly whether you can run it — and at what quantization level.
The calculator covers 80+ models and 40+ consumer and workstation GPUs. It also accounts for KV cache growth at different context lengths, which can tip an otherwise-fitting model over the edge. → Open the VRAM Calculator
Troubleshooting VRAM Issues
Out-of-Memory (OOM) Error on Load
The model is too large for your VRAM. Options, in order of preference:
--ctx-size in llama.cpp or num_ctx in Ollama--n-gpu-layers in llama.cpp to move layers to system RAMShared VRAM (Integrated Graphics / Laptops)
Many laptop GPUs and all integrated graphics (Intel Iris Xe, AMD Radeon 780M) share a pool of system RAM as VRAM. This "shared VRAM" is much slower than dedicated GDDR memory.
Technically you can run LLMs on shared VRAM, but performance is 3–10× slower than a dedicated GPU. Treat shared VRAM as a proof-of-concept tier, not a daily driver.
On Windows, check Task Manager → Performance → GPU to see your dedicated vs shared memory split.
OOM During Generation (Not on Load)
The model loaded fine but crashes mid-conversation. This is almost always the KV cache growing as context accumulates. Fixes:
- Reduce
num_ctx(Ollama) or--ctx-size(llama.cpp) to a smaller value (2048 or 4096) - Start a new conversation to clear the cache
- Some models have large per-layer KV sizes; check the model card for context limits
GPU Offloading with llama.cpp
If a model is slightly too large for your VRAM, partial GPU offloading splits layers between GPU and CPU:
# Load 28 of 32 layers on GPU; rest on CPU
llama-cli -m model.gguf --n-gpu-layers 28 -p "Your prompt here"Performance degrades proportionally to how many layers land in CPU memory. 28/32 layers on GPU is fast; 16/32 is noticeably slower; below half is rarely worth it versus running CPU-only.
Checking Available VRAM
# NVIDIA
nvidia-smi
AMD
rocm-smi
macOS (Apple Silicon)
system_profiler SPDisplaysDataType | grep VRAMRun these before loading a model to confirm how much VRAM is actually free (other processes may be using GPU memory).
Summary
- 8 GB VRAM — 7B models at Q4; basic local LLM use
- 12 GB VRAM — 7B at Q8, 13–14B at Q4; solid all-rounder
- 16 GB VRAM — 14B at Q8, 27B at Q4; power user tier
- 24 GB VRAM — 32–35B models; approaching frontier quality
- 48 GB VRAM — 70B models at Q4; near-cloud-quality locally
Parameters × bytes-per-param + KV cache + ~1 GB overhead. MoE models are the exception — all weights must load but only a fraction activate per token.
When in doubt, use the VRAM calculator — it does the math for your specific model and GPU combination.
Get weekly model updates — VRAM data, benchmarks & setup guides
Know which new models your GPU can run before you download 4GB of weights. Free.