Why GPU Matters for Local LLMs
Your GPU is the single biggest factor in local LLM performance. More VRAM means bigger models. Faster memory bandwidth means faster token generation. Getting this right saves you hundreds of dollars and hours of frustration.
The Quick Answer
| Budget | GPU | VRAM | Best For |
|---|---|---|---|
| ~$200 | RTX 4060 | 8GB | 7B models, casual use |
| ~$350 | RTX 4060 Ti 16GB | 16GB | 13B models, daily driver |
| ~$550 | RTX 4070 Ti Super | 16GB | 13B models, fast inference |
| ~$700 | RTX 5070 Ti | 16GB | Best perf/dollar 2026 |
| ~$1000 | RTX 5080 | 16GB | Premium speed, 13B+ |
| ~$1500 | RTX 5090 | 32GB | 70B models, no compromises |
VRAM: The Most Important Spec
VRAM determines the largest model you can run. Everything else is secondary.
| VRAM | Max Model Size | Examples |
|---|---|---|
| 6GB | 7B (Q4) | Llama 3.3 8B quantized |
| 8GB | 7B-8B (Q5-Q8) | Most 7B models at good quality |
| 12GB | 13B (Q4) | Llama 13B, Mistral Medium |
| 16GB | 13B (Q6-Q8) | High quality 13B inference |
| 24GB | 34B (Q4) or 70B (Q2) | CodeLlama 34B, Mixtral 8x7B |
| 32GB | 70B (Q4) | Full quality large models |
| 48GB+ | 70B (Q8) or 120B+ | Research-grade, no compromises |
NVIDIA GPUs (Recommended)
NVIDIA dominates local LLM inference thanks to CUDA support. Every major framework (llama.cpp, vLLM, Ollama) works best with NVIDIA.
Budget Tier
RTX 4060 (8GB) — ~$200- Runs 7B models at 25-35 tok/s
- Struggles with anything above 8B at good quantization
- Great entry point but you'll outgrow it fast
Mid-Range (Best Value)
RTX 4060 Ti 16GB — ~$350- The 16GB version is key (skip the 8GB variant)
- Runs 13B models at Q5 quantization at ~20 tok/s
- Best value if you want to run serious models daily
- New in 2026, significant gen-over-gen improvement
- 16GB GDDR7 with much higher bandwidth
- Runs 13B at 35-45 tok/s — noticeably faster than 4060 Ti
High-End
RTX 5080 (16GB) — ~$1000- Top-tier for 13B models (50+ tok/s)
- Can run 34B models with aggressive quantization
- Diminishing returns vs 5070 Ti for most users
- The holy grail for local LLM enthusiasts
- 32GB VRAM runs 70B models at Q4 quantization
- 60+ tok/s on 13B models
- If budget allows, this is the one
Used Market Gems
RTX 3090 (24GB) — ~$600-700 used- 24GB VRAM is still incredibly useful
- Runs 34B models and 70B at low quantization
- Best price-to-VRAM ratio on the used market
- Check cooling — these run hot
AMD GPUs
AMD has improved dramatically for LLM inference via ROCm, but driver issues still crop up. Proceed with caution. RX 7900 XTX (24GB) — ~$800
- 24GB VRAM at a good price
- ROCm support is functional but not as polished as CUDA
- ~80% of equivalent NVIDIA performance in most benchmarks
- Best AMD option if you're committed to the ecosystem
- New RDNA 4 architecture
- Promising but LLM software support is still catching up
- Wait for benchmarks before buying for LLM use
Apple Silicon
If you're on Mac, you don't need a discrete GPU. Apple Silicon uses unified memory, which acts as both RAM and VRAM.
| Chip | Unified Memory | Best Model Size | tok/s (Llama 8B) |
|---|---|---|---|
| M3 | 8-24GB | 7B | ~15 |
| M3 Pro | 18-36GB | 13B | ~20 |
| M3 Max | 36-128GB | 70B | ~25 |
| M4 | 16-32GB | 13B | ~18 |
| M4 Pro | 24-48GB | 34B | ~28 |
| M4 Max | 36-128GB | 70B | ~35 |
Key Metrics Explained
Tokens per second (tok/s): How fast the model generates text. 20+ tok/s feels real-time. Below 10 is sluggish. Memory bandwidth: Determines how fast data moves to/from VRAM. Higher = faster inference. This is why the RTX 5090 outperforms the 4090 despite similar VRAM. Quantization compatibility: Some GPUs handle certain quantization formats better. NVIDIA is most flexible here.Our Recommendation
For most people: RTX 5070 Ti ($700) or used RTX 3090 ($600). Both give you enough VRAM for daily use with 13B+ models. On a budget: RTX 4060 Ti 16GB ($350). The 16GB of VRAM punches above its price class. No compromises: RTX 5090 ($1500). 32GB VRAM means you won't need to upgrade for years. Mac users: M4 Pro with 48GB ($2400 Mac Mini) is the sweet spot. M4 Max if you need 70B models.Check our Getting Started with Ollama guide to put your new hardware to work.
Stay ahead of the local AI curve
Weekly guides, hardware reviews, and model benchmarks. No spam.