Top models by use case, VRAM requirements at a glance, and copy-paste setup commands. Updated for 2026.
New models drop every week. Get notified which ones your GPU can run, with benchmarks and setup guides.
No spam. Unsubscribe anytime. The cheat sheet below is free — no email required.| Use Case | Best Model | VRAM | Speed | Notes |
|---|---|---|---|---|
| Coding | DeepSeek-R1:14b | 10 GB | Fast | Best code quality per GB VRAM |
| Coding (small) | Qwen2.5-Coder:7b | 6 GB | Very fast | Great on 8GB cards |
| Chat | Llama 3.3:70b-Q4 | 40 GB | Medium | Near GPT-4 quality locally |
| Chat (small) | Phi-4:14b | 10 GB | Very fast | Punches above its weight class |
| Reasoning | DeepSeek-R1:32b | 20 GB | Slower | Best open-source reasoning model |
| Reasoning (fast) | QwQ:32b-Q4 | 20 GB | Medium | Strong math & logic |
| Vision | LLaVA:13b | 10 GB | Medium | Describe images, read charts |
| General (fast) | Gemma3:4b | 4 GB | Blazing | Runs on almost any GPU |
| Embeddings | nomic-embed-text | 2 GB | Blazing | Best for RAG pipelines |
| GPU | VRAM | Max Model Size | Recommended Models |
|---|---|---|---|
| GTX 1080 / 1080 Ti | 8 / 11 GB | 7B (Q4) | Llama 3.2:7b, Mistral:7b |
| RTX 3060 | 12 GB | 13B (Q4) | Phi-4:14b-Q3, DeepSeek-R1:7b |
| RTX 3080 | 10 GB | 7B (Q8) / 13B (Q4) | Qwen2.5:7b, Mistral:7b |
| RTX 3090 / 4090 | 24 GB | 33B (Q4) / 13B (Q8) | DeepSeek-R1:14b, Phi-4:14b |
| RTX 4060 Ti 16GB | 16 GB | 13B (Q8) | Phi-4:14b, Qwen2.5-Coder:14b |
| RTX 4080 Super | 16 GB | 13B (Q8) | DeepSeek-R1:14b, Llava:13b |
| RTX 5090 | 32 GB | 70B (Q4) | Llama 3.3:70b, DeepSeek-R1:32b |
| 2× RTX 3090 | 48 GB | 70B (Q4) | Llama 3.3:70b-Q4 |
| Mac M3 Pro (36GB) | 36 GB unified | 34B (Q4) | Llama 3.3:70b-Q3, DeepSeek-R1:32b |
Rule of thumb: Q4 quantization ≈ 0.6 × model param count in GB VRAM. Q8 ≈ 1.1 × param count. Add 1–2 GB overhead for KV cache.
# 1. Install Ollama (Linux/Mac)
curl -fsSL https://ollama.ai/install.sh | sh
# 2. Pull a model (download weights)
ollama pull llama3.3:70b-instruct-q4_K_M # 40GB
ollama pull phi4:14b # 9GB
ollama pull deepseek-r1:14b # 10GB
ollama pull qwen2.5-coder:7b # 5GB
ollama pull gemma3:4b # 3GB — fastest
ollama pull nomic-embed-text # 2GB — embeddings
# 3. Run interactively
ollama run phi4:14b
# 4. Use via OpenAI-compatible API
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi4:14b",
"messages": [{"role":"user","content":"Explain transformers in 2 sentences"}]
}'
# 5. List downloaded models
ollama list
# 6. See GPU utilization during inference
watch -n1 nvidia-smi # Linux/Windows
sudo powermetrics --gpu-power-overhead 1000 | grep GPU # Mac
Get notified which ones your GPU can run, with VRAM data, benchmarks, and setup guides. Free.
Unsubscribe anytime. ← Back to LocalLLMGuide