Most People Overspend on GPU
The most common mistake when buying a GPU for local LLMs: buying more VRAM than you need, or buying for gaming benchmarks instead of inference performance. For local LLMs, VRAM is the only number that matters. A $300 GPU with 24GB VRAM outperforms a $700 GPU with 12GB VRAM for running large models. This guide cuts through the marketing noise and tells you exactly what to buy at each budget.
Why VRAM Matters More Than TFLOPS
📬 Enjoying this guide?
Get these updates in your inbox every week
New VRAM data, model benchmarks, and setup guides — straight to you. Free.
Traditional GPU benchmarks measure TFLOPS (floating-point operations per second) — relevant for gaming, rendering, and training. For LLM inference, the model must fit entirely in VRAM.
If the model doesn't fit, it overflows to system RAM (or disk), which is 10-50x slower. A GPU with half the TFLOPS but double the VRAM will generate tokens dramatically faster. The key insight: You're not doing heavy math per token — you're doing a lot of matrix multiplications on a very large set of weights. Bandwidth and VRAM capacity dominate inference speed, not raw compute.
VRAM Requirements by Model Size
| Model Size | Q4 VRAM | Q5 VRAM | Q8 VRAM | FP16 VRAM |
|---|---|---|---|---|
| 3B params | ~2GB | ~2.5GB | ~3.5GB | ~6GB |
| 7-8B params | ~4.5GB | ~6GB | ~8GB | ~16GB |
| 13-14B params | ~8GB | ~10GB | ~14GB | ~28GB |
| 30-34B params | ~17GB | ~21GB | ~34GB | ~68GB |
| 70-72B params | ~38GB | ~48GB | ~72GB | ~144GB |
| 109B params | ~55GB | ~68GB | ~109GB | ~218GB |
Memory Bandwidth: The Hidden Performance Factor
Once the model fits, memory bandwidth determines how fast tokens generate. More bandwidth = more tokens per second.
| GPU | VRAM | Bandwidth | Est. Speed (8B Q4) |
|---|---|---|---|
| RTX 3060 12GB | 12GB | 360 GB/s | ~18-22 tok/s |
| RTX 4060 Ti 16GB | 16GB | 288 GB/s | ~16-20 tok/s |
| RTX 3090 24GB | 24GB | 936 GB/s | ~35-45 tok/s |
| RTX 4090 24GB | 24GB | 1,008 GB/s | ~55-65 tok/s |
| RTX 5090 32GB | 32GB | 1,792 GB/s | ~80-100 tok/s |
| M4 Max 128GB | 128GB | 546 GB/s | ~40-50 tok/s |
---
GPU Recommendations by Budget
Under $300 — Entry Level
Best pick: Used RTX 3060 12GB (~$200-250)The RTX 3060 12GB is a sleeper hit for local LLMs. Despite being a mid-range 2021 gaming GPU, its 12GB VRAM comfortably fits 8B models at Q8 and 13B models at Q4. Used units are widely available for $200-250. What you can run:
- Llama 4 Vega 8B at Q8 (8GB) — excellent quality
- Qwen3 14B at Q4 (~8GB) — strong reasoning
- DeepSeek V3.2 7B at Q8 (~7GB) — great for coding
- Phi-4 Mini (3.8B) at FP16 (~8GB) — Microsoft's tiny powerhouse
- 30B+ models (need 16GB+)
- 70B models (need 38GB+ at Q4)
---
$300–$600 — Mid-Range
Best pick: Used RTX 3090 24GB (~$400-500)This is the single best value GPU for local LLMs in 2026. The RTX 3090's 24GB VRAM runs 30B models at Q4 and 70B models at Q2. Its 936 GB/s bandwidth makes it fast despite being a 2020 GPU. What you can run:
- Any 8B model at FP16 (16GB) — reference quality
- Llama 4 Scout 109B at Q2 (~28GB — needs dual 3090 or extra RAM offload)
- Qwen3 30B at Q5 (~21GB) — excellent quality
- Kimi K2.5 72B at Q2 (~19GB) — powerful but degraded quality
- DeepSeek V3.2 235B at Q2 (~60GB — needs offloading)
- 16GB fits 30B models at Q4 (~17GB — tight)
- Slower than the RTX 3090 at LLM inference due to lower bandwidth
- Better power efficiency (165W vs 350W)
- Good choice if electricity cost matters or if you're on a small form factor build
$600–$1,200 — Performance Tier
Best pick: RTX 5070 Ti 16GB (~$750)The RTX 5070 Ti brings next-gen bandwidth (896 GB/s) and 16GB GDDR7 VRAM. It's significantly faster than the RTX 4060 Ti 16GB at LLM inference and beats the RTX 3090 in speed while using 40% less power. What you can run:
- All 8B models at FP16
- 13B models at Q8 (~14GB)
- 30B models at Q4 (~17GB — fits with tight context)
- Fast: ~50-60 tok/s on 8B Q4 models
---
$1,200–$2,000 — Enthusiast Tier
Best pick: RTX 4090 24GB (~$1,400 new, ~$1,100 used)The RTX 4090 is still the consumer-GPU gold standard for local LLMs. 24GB VRAM + 1,008 GB/s bandwidth handles 30B models comfortably and 70B at Q2 squeezed. The used price has dropped significantly in 2026 due to RTX 5090 availability. What you can run:
- Any 8B model at FP16
- Qwen3 30B at Q5 (~21GB)
- Kimi K2.5 72B at Q2 (~19GB)
- DeepSeek V3.2 7B at FP16
- ~55-65 tok/s on 8B Q4 models
---
$2,000+ — High-End
Best pick: RTX 5090 32GB (~$2,000-2,200)The RTX 5090 is the fastest consumer GPU ever made. Its 32GB GDDR7 VRAM and 1,792 GB/s bandwidth make it genuinely competitive with professional-grade cards. If you're running 70B models or multiple users, this is the endgame consumer option. What you can run:
- Kimi K2.5 72B at Q4 (~36GB — fits with slight headroom if using Q3_K_M variant at ~27GB)
- Llama 4 Scout 109B at Q2 (~28GB)
- Any 30B model at Q8 (~34GB)
- DeepSeek V3.2 235B with partial offloading
- ~80-100 tok/s on 8B Q4 models
---
Full GPU Comparison Table
| GPU | VRAM | Bandwidth | Price (March 2026) | Best For |
|---|---|---|---|---|
| RTX 3060 12GB | 12GB | 360 GB/s | ~$220 (used) | 8B-13B models on a budget |
| Intel Arc B580 12GB | 12GB | 456 GB/s | ~$250 (new) | 8B-13B, Linux-friendly |
| RTX 4060 Ti 16GB | 16GB | 288 GB/s | ~$380 (new) | 13B-30B, power-efficient |
| RTX 3090 24GB | 24GB | 936 GB/s | ~$450 (used) | Best value — 30B models |
| RTX 5070 Ti 16GB | 16GB | 896 GB/s | ~$750 (new) | Fast inference, 13B-30B |
| RTX 4090 24GB | 24GB | 1,008 GB/s | ~$1,200 (used) | 30B+ models, top consumer |
| RTX 5090 32GB | 32GB | 1,792 GB/s | ~$2,100 (new) | 70B models, multi-user |
---
Multi-GPU Setups: When 2× RTX 3090 Beats 1× RTX 4090
Two GPUs can run models that don't fit in a single card's VRAM. They also provide more total VRAM capacity for larger models.
2× RTX 3090 (48GB total) vs 1× RTX 4090 (24GB)
| Metric | 2× RTX 3090 | 1× RTX 4090 |
|---|---|---|
| Total VRAM | 48GB | 24GB |
| Bandwidth | 2× 936 GB/s | 1× 1,008 GB/s |
| Models that fit (Q4) | Up to 72B | Up to 34B |
| Power draw | ~700W | ~350W |
| Hardware cost | ~$900 (used pair) | ~$1,200 |
| Complexity | PCIe bandwidth split | Single GPU, simpler |
2× RTX 3090 Setup Requirements
- Motherboard with 2 full-length PCIe x16 slots
- 1000W+ PSU (each 3090 draws ~350W peak)
- Good case airflow — these cards run hot
- Ollama and llama.cpp support multi-GPU via
CUDA_VISIBLE_DEVICES
# Run model across 2 GPUs with llama.cpp
llama-server -m model.gguf --n-gpu-layers 99 --split-mode row
Or with Ollama (auto-detects multiple GPUs)
ollama run llama4-scout
When Multi-GPU Makes Sense
✅ Use multi-GPU if:
- You want to run 70B models at Q4 quality
- You already own one RTX 3090 and want to expand
- You're running a local inference server for multiple users
- You just want to chat with 8B-30B models
- Power efficiency matters (doubles power draw)
- Your motherboard only has one PCIe slot
Apple Silicon: M2 Ultra, M3 Max, M4 Max
Apple Silicon uses unified memory — system RAM and GPU memory are the same pool. An M4 Max with 128GB of RAM has 128GB available for LLM weights. No dedicated VRAM limit.
Apple Silicon Comparison for LLMs
| Chip | Max Unified Memory | Bandwidth | Est. Speed (8B Q4) | Best For |
|---|---|---|---|---|
| M2 Ultra | 192GB | 800 GB/s | ~38-45 tok/s | 109B models at Q5+ |
| M3 Max | 128GB | 400 GB/s | ~35-42 tok/s | 70B models at Q5+ |
| M4 Max | 128GB | 546 GB/s | ~40-50 tok/s | 70B models at Q5+ |
| M4 Ultra | 192GB | 819 GB/s | ~50-60 tok/s | 109B+ models |
Why Apple Silicon Is Different
Advantages:- Massive effective VRAM: 64-192GB depending on configuration
- Silent, efficient: 60-80W for the whole machine vs 350W+ for an RTX 4090
- Great for large models: Run Llama 4 Scout 109B at Q5 quality on an M4 Max
- MLX framework: Apple's MLX library is optimized for Apple Silicon inference — often faster than llama.cpp
- Bandwidth ceiling: Even M4 Max at 546 GB/s is slower than RTX 5090 at 1,792 GB/s for small models
- Price: M4 Max MacBook Pro starts at ~$2,500. Mac Pro with M4 Ultra is ~$7,000+
- Locked ecosystem: Can't upgrade memory or add a second GPU
MLX: Apple's LLM Framework
Use MLX for best performance on Apple Silicon:
# Install MLX and the community model runner
pip install mlx-lm
Run a 70B model on M4 Max
mlx_lm.generate --model mlx-community/Qwen3-72B-4bit --prompt "Explain transformers"
Run interactively
mlx_lm.chat --model mlx-community/Kimi-K2.5-72B-Instruct-4bitMLX models are available pre-converted at huggingface.co/mlx-community.
Apple Silicon vs NVIDIA: When to Choose Each
Choose Apple Silicon if:- You want a laptop that runs 70B models
- Silent operation matters
- You're already in the Mac ecosystem
- You want to run very large models (109B+) with high quantization quality
- You're building a desktop workstation
- You want the fastest tokens per second on 8B-30B models
- You need CUDA compatibility for training or development
- Budget is your primary constraint
Cloud GPU Alternatives: When Renting Beats Buying
Buying a GPU requires upfront capital. Renting cloud GPUs costs more per hour but starts immediately.
Cloud GPU Pricing (March 2026)
| Provider | GPU | Price/Hour | Notes |
|---|---|---|---|
| RunPod | RTX 4090 24GB | ~$0.35/hr | Spot pricing, occasionally unavailable |
| RunPod | RTX 3090 24GB | ~$0.25/hr | Reliable availability |
| RunPod | H100 80GB | ~$2.50/hr | Professional, high throughput |
| Lambda Labs | A100 40GB | ~$1.10/hr | On-demand, stable |
| Vast.ai | RTX 4090 | ~$0.30/hr | Peer-to-peer, variable reliability |
| AWS g5.xlarge | A10G 24GB | ~$1.01/hr | Enterprise SLA, expensive |
Breakeven Math
At what point does buying a GPU beat renting? Scenario: RTX 3090 24GB
- Purchase price: $450 (used)
- Cloud equivalent: RunPod RTX 3090 at $0.25/hr
- Break-even: 450 ÷ 0.25 = 1,800 hours of use (~75 days of 24/7 usage)
- Purchase price: $1,200 (used)
- Cloud equivalent: RunPod RTX 4090 at $0.35/hr
- Break-even: 1,200 ÷ 0.35 = ~3,430 hours (~143 days of 24/7 usage)
Rent vs Buy Decision
✅ Rent cloud GPU if:
- You use AI occasionally (< 1 hour/day)
- You want to try 70B+ models before committing to a hardware purchase
- You need burst capacity for a specific project
- You don't want to deal with hardware setup
- You use AI daily (2+ hours/day)
- Privacy matters and you don't want data on third-party infrastructure
- You want zero marginal cost per query
- You're building a local inference server for a team
Quick Decision Guide
| Your Use Case | Recommendation | Budget |
|---|---|---|
| Test local LLMs, 8B models | RTX 3060 12GB (used) | ~$220 |
| Daily driver, 13B models | RTX 4060 Ti 16GB | ~$380 |
| Best value, 30B models | RTX 3090 24GB (used) | ~$450 |
| Fast inference, power-efficient | RTX 5070 Ti 16GB | ~$750 |
| Top consumer GPU, 30B+ | RTX 4090 24GB | ~$1,200 |
| 70B models, max performance | RTX 5090 32GB | ~$2,100 |
| 70B Q4, budget multi-GPU | 2× RTX 3090 | ~$900 |
| Laptop, 70B+ models | M4 Max (128GB) | ~$3,500 |
| Laptop, 109B+ models | M4 Ultra (192GB) | ~$7,000 |
The Bottom Line
Most people need 24GB VRAM. A used RTX 3090 at $450 covers 90% of local LLM use cases — 8B through 30B models at high quality, with room for longer context windows. Don't buy a GPU just for TFLOPS. The RTX 4060 Ti is a better gaming GPU than the RTX 3090 but a worse LLM GPU. VRAM and bandwidth are what matter. Apple Silicon is different, not better or worse. If you want a silent laptop that can run 70B models, M4 Max is unbeatable. If you want the fastest tokens per second on a desktop, RTX 5090 wins. Cloud beats buying if you use AI occasionally. If you run < 1 hour/day, renting on RunPod is cheaper until year 1-2.Not sure which models fit your GPU? Use our VRAM Calculator to get exact compatibility before you buy.
Related Guides
- How to Run a Local LLM on Your Laptop — setup guide for Mac and Windows
- Quantization Guide: Q4, Q5, Q8 Explained — how quantization affects VRAM and quality
- Local LLM vs ChatGPT: When to Use Each — full cost and quality comparison
- VRAM Calculator — find the largest model your GPU can run
Get weekly model updates — VRAM data, benchmarks & setup guides
Know which new models your GPU can run before you download 4GB of weights. Free.