How Much VRAM Do You Need for LLMs in 2026?

GPU memory is the single biggest bottleneck when running LLMs locally. Buy too little and your model crashes with an out-of-memory error. Buy too much and you've wasted hundreds of dollars on headroom you'll never use.

This article gives you the formula, a quick-reference table for 15 popular models, and a breakdown of what you can realistically run at each VRAM tier. Use the VRAM calculator to get an instant answer for your specific GPU.

Why VRAM Is the Bottleneck

When you load an LLM, every single model weight must fit in GPU memory before inference can begin. Unlike system RAM, VRAM is a fixed pool — the GPU cannot swap weights in and out from disk during a forward pass without grinding to a halt.

Three things compete for that pool:

Model weights — the largest chunk by far
KV cache — grows with context length; a 32K context window can rival the weights themselves
Activations & overhead — typically 1–2 GB for the framework (Ollama, llama.cpp, vLLM)

Fit everything inside VRAM and inference runs at full GPU speed. Overflow into system RAM and you are limited by PCIe bandwidth — usually 10–40× slower.

The VRAM Formula

📬 Enjoying this guide?

Get these updates in your inbox every week

New VRAM data, model benchmarks, and setup guides — straight to you. Free.

The math is straightforward:

VRAM (GB) = (Parameters × Bytes per Parameter) + KV Cache + Overhead

Bytes per parameter by precision:

Precision	Bytes / Param	Notes
FP32	4 bytes	Training only; never used for inference
FP16 / BF16	2 bytes	Standard full-precision inference
Q8_0	~1 byte	8-bit quantization
Q5_K_M	~0.625 bytes	5-bit quantization
Q4_K_M	~0.5 bytes	4-bit quantization; most popular balance
Q2_K	~0.25 bytes	2-bit; significant quality loss

Worked example — Llama 4 Scout (109B parameters) at Q4_K_M:

Weights:  109B × 0.5 bytes = 54.5 GB
BUT: Scout is MoE — only 17B params are active at inference.
Effective weight VRAM: ~17B × 0.5 bytes ≈ 8.5 GB
KV cache (4K context): ~0.5 GB
Overhead: ~1 GB
Total: ~10 GB

The MoE caveat is important — we cover it in detail below. Worked example — Mistral 7B at FP16:

Weights:  7B × 2 bytes = 14 GB
KV cache (4K context): ~0.3 GB
Overhead: ~1 GB
Total: ~15.3 GB → needs 16 GB VRAM

At Q4_K_M: 7B × 0.5 bytes ≈ 3.5 GB + overhead ≈ 5 GB — fits on any modern laptop GPU.

Quick Reference: Top 15 Models

VRAM requirements at three common quantization levels. Numbers include ~1–1.5 GB of framework overhead. Use the VRAM calculator for exact figures on your hardware.

Model	Params	Q4_K_M	Q8_0	FP16
Llama 3.2 3B	3B	~2.5 GB	~4 GB	~7 GB
Mistral 7B	7B	~5 GB	~8.5 GB	~15 GB
Llama 3.1 8B	8B	~5.5 GB	~9.5 GB	~17 GB
Gemma 3 9B	9B	~6 GB	~10.5 GB	~19 GB
Llama 3.1 70B	70B	~40 GB	~75 GB	~140 GB
Qwen 2.5 7B	7B	~5 GB	~8.5 GB	~15 GB
Qwen 2.5 72B	72B	~42 GB	~78 GB	~144 GB
Phi-4 14B	14B	~9 GB	~15.5 GB	~29 GB
DeepSeek-R1 7B	7B	~5 GB	~8.5 GB	~15 GB
DeepSeek-R1 14B	14B	~9 GB	~15.5 GB	~29 GB
DeepSeek-R1 32B	32B	~20 GB	~35 GB	~65 GB
Llama 4 Scout	109B (MoE)	~10 GB	~19 GB	~36 GB
Mixtral 8x7B	56B (MoE)	~26 GB	~47 GB	~90 GB
Gemma 3 27B	27B	~17 GB	~29 GB	~55 GB
Command-R 35B	35B	~22 GB	~38 GB	~70 GB

MoE models use fewer active parameters per forward pass — see the MoE section below.

Budget GPU Guide: What Fits Where

8 GB VRAM (RTX 3060, RX 6600 XT, RTX 4060)

The entry-level tier for serious local LLM use. Covers most popular 7–8B models at Q4. Fits comfortably:

Mistral 7B Q4_K_M (~5 GB)
Llama 3.1 8B Q4_K_M (~5.5 GB)
Qwen 2.5 7B Q4_K_M (~5 GB)
DeepSeek-R1 7B Q4_K_M (~5 GB)
Llama 3.2 3B FP16 (~7 GB)

Does NOT fit: Anything 13B+ at any quantization. Phi-4 14B at Q4 (~9 GB) will overflow. Tip: On 8 GB, aim for models with ≤5 GB VRAM usage to leave headroom for the KV cache as context grows.

12 GB VRAM (RTX 3060 12GB, RTX 4070, RX 7700 XT)

A significant step up. Now 13B models work, and you can run 7B models at Q8 for noticeably better output quality. Fits comfortably:

Mistral 7B Q8_0 (~8.5 GB)
Llama 3.1 8B Q8_0 (~9.5 GB)
Phi-4 14B Q4_K_M (~9 GB)
DeepSeek-R1 14B Q4_K_M (~9 GB)

Fits with tight margin: Gemma 3 9B Q8 (~10.5 GB) — limited context window. Does NOT fit: 20B+ models at any quantization. Llama 3.1 70B requires multi-GPU at any quant.

16 GB VRAM (RTX 3080, RTX 4080 Super, RX 7900 GRE)

The sweet spot for power users. Unlocks 14–15B models at Q8 and the 32B class at heavy quantization. Fits comfortably:

Phi-4 14B Q8_0 (~15.5 GB — tight)
DeepSeek-R1 14B Q8_0 (~15.5 GB — tight)
Llama 4 Scout Q4_K_M (~10 GB)
Gemma 3 27B Q4_K_M (~17 GB — tight, may need to reduce context)

Tip: With 16 GB you can also run quantization-assisted Llama 3.1 70B if you offload some layers to CPU — slower but functional.

24 GB VRAM (RTX 3090, RTX 4090, RTX A5000)

Consumer flagship territory. Most models under 30B run well. DeepSeek-R1 32B Q4 fits. Fits comfortably:

Gemma 3 27B Q8_0 (~29 GB — tight on 24 GB, better at Q5)
DeepSeek-R1 32B Q4_K_M (~20 GB)
Command-R 35B Q4_K_M (~22 GB)
Llama 4 Scout Q8_0 (~19 GB)
Mixtral 8x7B Q4_K_M (~26 GB — 1–2 GB over; offload a few layers to CPU)

Does NOT fit cleanly: Llama 3.1 70B still requires multi-GPU at Q4 (~40 GB). Tip: The RTX 4090 is the best single-GPU for local LLMs at any budget. If you are buying new hardware specifically for LLMs, nothing else at this price tier comes close.

48 GB VRAM (RTX A6000, RTX 6000 Ada, Dual 3090 NVLink)

Workstation / prosumer tier. Llama 3.1 70B runs at Q4 (~40 GB). Qwen 2.5 72B Q4 (~42 GB) just fits. Fits comfortably:

Llama 3.1 70B Q4_K_M (~40 GB)
Qwen 2.5 72B Q4_K_M (~42 GB)
Mixtral 8x7B FP16 (~90 GB — needs dual 48 GB)

Tip: Two RTX 3090s with NVLink give you 48 GB combined VRAM for less than the cost of a single A6000. Performance is lower than a true 48 GB GPU but workable for inference.

MoE Models: Why the Numbers Look Wrong

Mixture-of-Experts (MoE) models like Mixtral 8x7B and Llama 4 Scout have a confusing parameter count. Mixtral 8x7B has 56 billion total parameters, but only about 13 billion are active on any given forward pass. The router selects 2 of 8 expert networks per token, keeping the rest dormant. What this means for VRAM:

All weights must still load into VRAM — you need room for all 56B (or 109B for Scout) weight files
Active computation only touches ~13B (Mixtral) or ~17B (Scout) worth of weights per token
The result: compute speed feels like a 13B model, but memory footprint is closer to a 30B dense model

Model	Total Params	Active Params	VRAM at Q4	Speed Feel
Mixtral 8x7B	56B	~13B	~26 GB	Like a 13B
Llama 4 Scout	109B	17B	~10 GB	Like a 17B
Llama 4 Maverick	400B	17B	~20 GB	Like a 17B

Llama 4 Scout is an extreme case: 109B total parameters but only ~10 GB VRAM needed at Q4 because the MoE routing is highly sparse. It is the most memory-efficient high-quality model available as of 2026.

Quantization Tradeoffs

Quantization reduces precision to shrink model size. The key question is: how much quality do you actually lose?

Quantization	VRAM vs FP16	Quality Loss	Recommendation
FP16	1× (baseline)	None	Benchmarks, fine-tuning
Q8_0	~0.5×	<0.5%	When VRAM allows; near-lossless
Q5_K_M	~0.38×	~1–2%	Excellent balance for 12–16 GB cards
Q4_K_M	~0.31×	~3–5%	Default for most users; barely noticeable
Q3_K_M	~0.24×	~8–12%	Avoid unless VRAM-constrained
Q2_K	~0.19×	>20%	Last resort; noticeable degradation

For most users, Q4_K_M is the right default. The ~3–5% quality loss versus FP16 is imperceptible in most tasks. Q8_0 is worth it if your GPU has the headroom.

For a deeper dive on what these formats mean technically, see the quantization guide.

Check Your GPU — VRAM Calculator

Rather than doing the math manually, use the VRAM calculator. Select your model, pick your GPU, and it tells you instantly whether you can run it — and at what quantization level.

The calculator covers 80+ models and 40+ consumer and workstation GPUs. It also accounts for KV cache growth at different context lengths, which can tip an otherwise-fitting model over the edge. → Open the VRAM Calculator

Troubleshooting VRAM Issues

Out-of-Memory (OOM) Error on Load

The model is too large for your VRAM. Options, in order of preference:

Use a smaller quantization — switch from Q8 to Q4, or Q4 to Q3

Reduce context length — lower --ctx-size in llama.cpp or num_ctx in Ollama

CPU offload some layers — use --n-gpu-layers in llama.cpp to move layers to system RAM

Use a smaller model — a well-quantized 7B often outperforms a heavily quantized 13B

Shared VRAM (Integrated Graphics / Laptops)

Many laptop GPUs and all integrated graphics (Intel Iris Xe, AMD Radeon 780M) share a pool of system RAM as VRAM. This "shared VRAM" is much slower than dedicated GDDR memory.

Technically you can run LLMs on shared VRAM, but performance is 3–10× slower than a dedicated GPU. Treat shared VRAM as a proof-of-concept tier, not a daily driver.

On Windows, check Task Manager → Performance → GPU to see your dedicated vs shared memory split.

OOM During Generation (Not on Load)

The model loaded fine but crashes mid-conversation. This is almost always the KV cache growing as context accumulates. Fixes:

Reduce num_ctx (Ollama) or --ctx-size (llama.cpp) to a smaller value (2048 or 4096)
Start a new conversation to clear the cache
Some models have large per-layer KV sizes; check the model card for context limits

GPU Offloading with llama.cpp

If a model is slightly too large for your VRAM, partial GPU offloading splits layers between GPU and CPU:

# Load 28 of 32 layers on GPU; rest on CPU
llama-cli -m model.gguf --n-gpu-layers 28 -p "Your prompt here"

Performance degrades proportionally to how many layers land in CPU memory. 28/32 layers on GPU is fast; 16/32 is noticeably slower; below half is rarely worth it versus running CPU-only.

Checking Available VRAM

# NVIDIA
nvidia-smi

AMD
rocm-smi

macOS (Apple Silicon)
system_profiler SPDisplaysDataType | grep VRAM

Run these before loading a model to confirm how much VRAM is actually free (other processes may be using GPU memory).

Summary

8 GB VRAM — 7B models at Q4; basic local LLM use
12 GB VRAM — 7B at Q8, 13–14B at Q4; solid all-rounder
16 GB VRAM — 14B at Q8, 27B at Q4; power user tier
24 GB VRAM — 32–35B models; approaching frontier quality
48 GB VRAM — 70B models at Q4; near-cloud-quality locally

The formula: Parameters × bytes-per-param + KV cache + ~1 GB overhead. MoE models are the exception — all weights must load but only a fraction activate per token.

When in doubt, use the VRAM calculator — it does the math for your specific model and GPU combination.

Get weekly model updates — VRAM data, benchmarks & setup guides

Know which new models your GPU can run before you download 4GB of weights. Free.