12 min read

How Much VRAM Do You Need for LLMs in 2026?

GPU memory is the single biggest bottleneck for running LLMs locally. Here's the formula, a quick-reference table for the top 15 models, and exactly what you can run on 8 GB, 12 GB, 16 GB, 24 GB, and 48 GB of VRAM.

GPU memory is the single biggest bottleneck when running LLMs locally. Buy too little and your model crashes with an out-of-memory error. Buy too much and you've wasted hundreds of dollars on headroom you'll never use.

This article gives you the formula, a quick-reference table for 15 popular models, and a breakdown of what you can realistically run at each VRAM tier. Use the VRAM calculator to get an instant answer for your specific GPU.

Why VRAM Is the Bottleneck

When you load an LLM, every single model weight must fit in GPU memory before inference can begin. Unlike system RAM, VRAM is a fixed pool — the GPU cannot swap weights in and out from disk during a forward pass without grinding to a halt.

Three things compete for that pool:

Fit everything inside VRAM and inference runs at full GPU speed. Overflow into system RAM and you are limited by PCIe bandwidth — usually 10–40× slower.

The VRAM Formula

📬 Enjoying this guide?

Get these updates in your inbox every week

New VRAM data, model benchmarks, and setup guides — straight to you. Free.

The math is straightforward:

VRAM (GB) = (Parameters × Bytes per Parameter) + KV Cache + Overhead
Bytes per parameter by precision:
PrecisionBytes / ParamNotes
FP324 bytesTraining only; never used for inference
FP16 / BF162 bytesStandard full-precision inference
Q8_0~1 byte8-bit quantization
Q5_K_M~0.625 bytes5-bit quantization
Q4_K_M~0.5 bytes4-bit quantization; most popular balance
Q2_K~0.25 bytes2-bit; significant quality loss
Worked example — Llama 4 Scout (109B parameters) at Q4_K_M:
Weights:  109B × 0.5 bytes = 54.5 GB
BUT: Scout is MoE — only 17B params are active at inference.
Effective weight VRAM: ~17B × 0.5 bytes ≈ 8.5 GB
KV cache (4K context): ~0.5 GB
Overhead: ~1 GB
Total: ~10 GB

The MoE caveat is important — we cover it in detail below. Worked example — Mistral 7B at FP16:

Weights:  7B × 2 bytes = 14 GB
KV cache (4K context): ~0.3 GB
Overhead: ~1 GB
Total: ~15.3 GB → needs 16 GB VRAM

At Q4_K_M: 7B × 0.5 bytes ≈ 3.5 GB + overhead ≈ 5 GB — fits on any modern laptop GPU.

Quick Reference: Top 15 Models

VRAM requirements at three common quantization levels. Numbers include ~1–1.5 GB of framework overhead. Use the VRAM calculator for exact figures on your hardware.

ModelParamsQ4_K_MQ8_0FP16
Llama 3.2 3B3B~2.5 GB~4 GB~7 GB
Mistral 7B7B~5 GB~8.5 GB~15 GB
Llama 3.1 8B8B~5.5 GB~9.5 GB~17 GB
Gemma 3 9B9B~6 GB~10.5 GB~19 GB
Llama 3.1 70B70B~40 GB~75 GB~140 GB
Qwen 2.5 7B7B~5 GB~8.5 GB~15 GB
Qwen 2.5 72B72B~42 GB~78 GB~144 GB
Phi-4 14B14B~9 GB~15.5 GB~29 GB
DeepSeek-R1 7B7B~5 GB~8.5 GB~15 GB
DeepSeek-R1 14B14B~9 GB~15.5 GB~29 GB
DeepSeek-R1 32B32B~20 GB~35 GB~65 GB
Llama 4 Scout109B (MoE)~10 GB~19 GB~36 GB
Mixtral 8x7B56B (MoE)~26 GB~47 GB~90 GB
Gemma 3 27B27B~17 GB~29 GB~55 GB
Command-R 35B35B~22 GB~38 GB~70 GB
MoE models use fewer active parameters per forward pass — see the MoE section below.

Budget GPU Guide: What Fits Where

8 GB VRAM (RTX 3060, RX 6600 XT, RTX 4060)

The entry-level tier for serious local LLM use. Covers most popular 7–8B models at Q4. Fits comfortably:

Does NOT fit: Anything 13B+ at any quantization. Phi-4 14B at Q4 (~9 GB) will overflow. Tip: On 8 GB, aim for models with ≤5 GB VRAM usage to leave headroom for the KV cache as context grows.

12 GB VRAM (RTX 3060 12GB, RTX 4070, RX 7700 XT)

A significant step up. Now 13B models work, and you can run 7B models at Q8 for noticeably better output quality. Fits comfortably:

Fits with tight margin: Gemma 3 9B Q8 (~10.5 GB) — limited context window. Does NOT fit: 20B+ models at any quantization. Llama 3.1 70B requires multi-GPU at any quant.

16 GB VRAM (RTX 3080, RTX 4080 Super, RX 7900 GRE)

The sweet spot for power users. Unlocks 14–15B models at Q8 and the 32B class at heavy quantization. Fits comfortably:

Tip: With 16 GB you can also run quantization-assisted Llama 3.1 70B if you offload some layers to CPU — slower but functional.

24 GB VRAM (RTX 3090, RTX 4090, RTX A5000)

Consumer flagship territory. Most models under 30B run well. DeepSeek-R1 32B Q4 fits. Fits comfortably:

Does NOT fit cleanly: Llama 3.1 70B still requires multi-GPU at Q4 (~40 GB). Tip: The RTX 4090 is the best single-GPU for local LLMs at any budget. If you are buying new hardware specifically for LLMs, nothing else at this price tier comes close.

48 GB VRAM (RTX A6000, RTX 6000 Ada, Dual 3090 NVLink)

Workstation / prosumer tier. Llama 3.1 70B runs at Q4 (~40 GB). Qwen 2.5 72B Q4 (~42 GB) just fits. Fits comfortably:

Tip: Two RTX 3090s with NVLink give you 48 GB combined VRAM for less than the cost of a single A6000. Performance is lower than a true 48 GB GPU but workable for inference.

MoE Models: Why the Numbers Look Wrong

Mixture-of-Experts (MoE) models like Mixtral 8x7B and Llama 4 Scout have a confusing parameter count. Mixtral 8x7B has 56 billion total parameters, but only about 13 billion are active on any given forward pass. The router selects 2 of 8 expert networks per token, keeping the rest dormant. What this means for VRAM:

ModelTotal ParamsActive ParamsVRAM at Q4Speed Feel
Mixtral 8x7B56B~13B~26 GBLike a 13B
Llama 4 Scout109B17B~10 GBLike a 17B
Llama 4 Maverick400B17B~20 GBLike a 17B
Llama 4 Scout is an extreme case: 109B total parameters but only ~10 GB VRAM needed at Q4 because the MoE routing is highly sparse. It is the most memory-efficient high-quality model available as of 2026.

Quantization Tradeoffs

Quantization reduces precision to shrink model size. The key question is: how much quality do you actually lose?

QuantizationVRAM vs FP16Quality LossRecommendation
FP161× (baseline)NoneBenchmarks, fine-tuning
Q8_0~0.5×<0.5%When VRAM allows; near-lossless
Q5_K_M~0.38×~1–2%Excellent balance for 12–16 GB cards
Q4_K_M~0.31×~3–5%Default for most users; barely noticeable
Q3_K_M~0.24×~8–12%Avoid unless VRAM-constrained
Q2_K~0.19×>20%Last resort; noticeable degradation
For most users, Q4_K_M is the right default. The ~3–5% quality loss versus FP16 is imperceptible in most tasks. Q8_0 is worth it if your GPU has the headroom.

For a deeper dive on what these formats mean technically, see the quantization guide.

Check Your GPU — VRAM Calculator

Rather than doing the math manually, use the VRAM calculator. Select your model, pick your GPU, and it tells you instantly whether you can run it — and at what quantization level.

The calculator covers 80+ models and 40+ consumer and workstation GPUs. It also accounts for KV cache growth at different context lengths, which can tip an otherwise-fitting model over the edge. Open the VRAM Calculator

Troubleshooting VRAM Issues

Out-of-Memory (OOM) Error on Load

The model is too large for your VRAM. Options, in order of preference:

  • Use a smaller quantization — switch from Q8 to Q4, or Q4 to Q3
  • Reduce context length — lower --ctx-size in llama.cpp or num_ctx in Ollama
  • CPU offload some layers — use --n-gpu-layers in llama.cpp to move layers to system RAM
  • Use a smaller model — a well-quantized 7B often outperforms a heavily quantized 13B
  • Shared VRAM (Integrated Graphics / Laptops)

    Many laptop GPUs and all integrated graphics (Intel Iris Xe, AMD Radeon 780M) share a pool of system RAM as VRAM. This "shared VRAM" is much slower than dedicated GDDR memory.

    Technically you can run LLMs on shared VRAM, but performance is 3–10× slower than a dedicated GPU. Treat shared VRAM as a proof-of-concept tier, not a daily driver.

    On Windows, check Task Manager → Performance → GPU to see your dedicated vs shared memory split.

    OOM During Generation (Not on Load)

    The model loaded fine but crashes mid-conversation. This is almost always the KV cache growing as context accumulates. Fixes:

    GPU Offloading with llama.cpp

    If a model is slightly too large for your VRAM, partial GPU offloading splits layers between GPU and CPU:

    # Load 28 of 32 layers on GPU; rest on CPU
    llama-cli -m model.gguf --n-gpu-layers 28 -p "Your prompt here"

    Performance degrades proportionally to how many layers land in CPU memory. 28/32 layers on GPU is fast; 16/32 is noticeably slower; below half is rarely worth it versus running CPU-only.

    Checking Available VRAM

    # NVIDIA
    nvidia-smi
    
    

    AMD

    rocm-smi

    macOS (Apple Silicon)

    system_profiler SPDisplaysDataType | grep VRAM

    Run these before loading a model to confirm how much VRAM is actually free (other processes may be using GPU memory).

    Summary

    The formula: Parameters × bytes-per-param + KV cache + ~1 GB overhead. MoE models are the exception — all weights must load but only a fraction activate per token.

    When in doubt, use the VRAM calculator — it does the math for your specific model and GPU combination.

    Get weekly model updates — VRAM data, benchmarks & setup guides

    Know which new models your GPU can run before you download 4GB of weights. Free.