2026 Local LLM Cheat Sheet

🎯 Top Models by Use Case (2026)

Use Case	Best Model	VRAM	Speed	Notes
Coding	DeepSeek-R1:14b	10 GB	Fast	Best code quality per GB VRAM
Coding (small)	Qwen2.5-Coder:7b	6 GB	Very fast	Great on 8GB cards
Chat	Llama 3.3:70b-Q4	40 GB	Medium	Near GPT-4 quality locally
Chat (small)	Phi-4:14b	10 GB	Very fast	Punches above its weight class
Reasoning	DeepSeek-R1:32b	20 GB	Slower	Best open-source reasoning model
Reasoning (fast)	QwQ:32b-Q4	20 GB	Medium	Strong math & logic
Vision	LLaVA:13b	10 GB	Medium	Describe images, read charts
General (fast)	Gemma3:4b	4 GB	Blazing	Runs on almost any GPU
Embeddings	nomic-embed-text	2 GB	Blazing	Best for RAG pipelines

🖥️ GPU VRAM Reference

GPU	VRAM	Max Model Size	Recommended Models
GTX 1080 / 1080 Ti	8 / 11 GB	7B (Q4)	Llama 3.2:7b, Mistral:7b
RTX 3060	12 GB	13B (Q4)	Phi-4:14b-Q3, DeepSeek-R1:7b
RTX 3080	10 GB	7B (Q8) / 13B (Q4)	Qwen2.5:7b, Mistral:7b
RTX 3090 / 4090	24 GB	33B (Q4) / 13B (Q8)	DeepSeek-R1:14b, Phi-4:14b
RTX 4060 Ti 16GB	16 GB	13B (Q8)	Phi-4:14b, Qwen2.5-Coder:14b
RTX 4080 Super	16 GB	13B (Q8)	DeepSeek-R1:14b, Llava:13b
RTX 5090	32 GB	70B (Q4)	Llama 3.3:70b, DeepSeek-R1:32b
2× RTX 3090	48 GB	70B (Q4)	Llama 3.3:70b-Q4
Mac M3 Pro (36GB)	36 GB unified	34B (Q4)	Llama 3.3:70b-Q3, DeepSeek-R1:32b

Rule of thumb: Q4 quantization ≈ 0.6 × model param count in GB VRAM. Q8 ≈ 1.1 × param count. Add 1–2 GB overhead for KV cache.

⚡ Quick-Start Commands (Ollama)

# 1. Install Ollama (Linux/Mac)
curl -fsSL https://ollama.ai/install.sh | sh

# 2. Pull a model (download weights)
ollama pull llama3.3:70b-instruct-q4_K_M    # 40GB
ollama pull phi4:14b                         # 9GB
ollama pull deepseek-r1:14b                  # 10GB
ollama pull qwen2.5-coder:7b                 # 5GB
ollama pull gemma3:4b                        # 3GB — fastest
ollama pull nomic-embed-text                 # 2GB — embeddings

# 3. Run interactively
ollama run phi4:14b

# 4. Use via OpenAI-compatible API
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "phi4:14b",
    "messages": [{"role":"user","content":"Explain transformers in 2 sentences"}]
  }'

# 5. List downloaded models
ollama list

# 6. See GPU utilization during inference
watch -n1 nvidia-smi     # Linux/Windows
sudo powermetrics --gpu-power-overhead 1000 | grep GPU  # Mac

💡 Quick Tips

Quantization Guide

Q2_K Smallest, noticeable quality loss
Q3_K_M Good for RAM-constrained setups
Q4_K_M Best quality/size balance ✓
Q5_K_M Near-lossless, ~10% larger
Q8_0 Essentially lossless, 2× Q4 size
F16 Full precision, 2× Q8 — rarely needed

Performance Tips

Keep context window small Reduces VRAM
Use Q4_K_M quantization Best default
Close GPU apps during inference +30% speed
SSD for model storage Faster load time
Use GPU layers, not CPU 10× faster
Apple Silicon unified memory Excellent efficiency

Useful Ollama Env Vars

OLLAMA_NUM_PARALLEL Concurrent requests
OLLAMA_MAX_LOADED_MODELS Models in VRAM
OLLAMA_KEEP_ALIVE How long to keep loaded
OLLAMA_GPU_OVERHEAD Reserve VRAM for OS

Alternative Runtimes

LM Studio GUI, beginner-friendly
llama.cpp Lowest level, max control
vLLM Production serving
Jan.ai Desktop UI + API
GPT4All Offline-first UI

📬 Get weekly model updates too

🎯 Top Models by Use Case (2026)

🖥️ GPU VRAM Reference

⚡ Quick-Start Commands (Ollama)

💡 Quick Tips

Quantization Guide

Performance Tips

Useful Ollama Env Vars

Alternative Runtimes

New models drop every week