⚡ Free Reference Guide

2026 Local LLM Cheat Sheet

Top models by use case, VRAM requirements at a glance, and copy-paste setup commands. Updated for 2026.

📬 Get weekly model updates too

New models drop every week. Get notified which ones your GPU can run, with benchmarks and setup guides.

No spam. Unsubscribe anytime. The cheat sheet below is free — no email required.

🎯 Top Models by Use Case (2026)

Use CaseBest ModelVRAMSpeedNotes
CodingDeepSeek-R1:14b10 GBFastBest code quality per GB VRAM
Coding (small)Qwen2.5-Coder:7b6 GBVery fastGreat on 8GB cards
ChatLlama 3.3:70b-Q440 GBMediumNear GPT-4 quality locally
Chat (small)Phi-4:14b10 GBVery fastPunches above its weight class
ReasoningDeepSeek-R1:32b20 GBSlowerBest open-source reasoning model
Reasoning (fast)QwQ:32b-Q420 GBMediumStrong math & logic
VisionLLaVA:13b10 GBMediumDescribe images, read charts
General (fast)Gemma3:4b4 GBBlazingRuns on almost any GPU
Embeddingsnomic-embed-text2 GBBlazingBest for RAG pipelines

🖥️ GPU VRAM Reference

GPUVRAMMax Model SizeRecommended Models
GTX 1080 / 1080 Ti8 / 11 GB7B (Q4)Llama 3.2:7b, Mistral:7b
RTX 306012 GB13B (Q4)Phi-4:14b-Q3, DeepSeek-R1:7b
RTX 308010 GB7B (Q8) / 13B (Q4)Qwen2.5:7b, Mistral:7b
RTX 3090 / 409024 GB33B (Q4) / 13B (Q8)DeepSeek-R1:14b, Phi-4:14b
RTX 4060 Ti 16GB16 GB13B (Q8)Phi-4:14b, Qwen2.5-Coder:14b
RTX 4080 Super16 GB13B (Q8)DeepSeek-R1:14b, Llava:13b
RTX 509032 GB70B (Q4)Llama 3.3:70b, DeepSeek-R1:32b
2× RTX 309048 GB70B (Q4)Llama 3.3:70b-Q4
Mac M3 Pro (36GB)36 GB unified34B (Q4)Llama 3.3:70b-Q3, DeepSeek-R1:32b

Rule of thumb: Q4 quantization ≈ 0.6 × model param count in GB VRAM. Q8 ≈ 1.1 × param count. Add 1–2 GB overhead for KV cache.

⚡ Quick-Start Commands (Ollama)

# 1. Install Ollama (Linux/Mac)
curl -fsSL https://ollama.ai/install.sh | sh

# 2. Pull a model (download weights)
ollama pull llama3.3:70b-instruct-q4_K_M    # 40GB
ollama pull phi4:14b                         # 9GB
ollama pull deepseek-r1:14b                  # 10GB
ollama pull qwen2.5-coder:7b                 # 5GB
ollama pull gemma3:4b                        # 3GB — fastest
ollama pull nomic-embed-text                 # 2GB — embeddings

# 3. Run interactively
ollama run phi4:14b

# 4. Use via OpenAI-compatible API
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "phi4:14b",
    "messages": [{"role":"user","content":"Explain transformers in 2 sentences"}]
  }'

# 5. List downloaded models
ollama list

# 6. See GPU utilization during inference
watch -n1 nvidia-smi     # Linux/Windows
sudo powermetrics --gpu-power-overhead 1000 | grep GPU  # Mac

💡 Quick Tips

Quantization Guide

  • Q2_K Smallest, noticeable quality loss
  • Q3_K_M Good for RAM-constrained setups
  • Q4_K_M Best quality/size balance ✓
  • Q5_K_M Near-lossless, ~10% larger
  • Q8_0 Essentially lossless, 2× Q4 size
  • F16 Full precision, 2× Q8 — rarely needed

Performance Tips

  • Keep context window small Reduces VRAM
  • Use Q4_K_M quantization Best default
  • Close GPU apps during inference +30% speed
  • SSD for model storage Faster load time
  • Use GPU layers, not CPU 10× faster
  • Apple Silicon unified memory Excellent efficiency

Useful Ollama Env Vars

  • OLLAMA_NUM_PARALLEL Concurrent requests
  • OLLAMA_MAX_LOADED_MODELS Models in VRAM
  • OLLAMA_KEEP_ALIVE How long to keep loaded
  • OLLAMA_GPU_OVERHEAD Reserve VRAM for OS

Alternative Runtimes

  • LM Studio GUI, beginner-friendly
  • llama.cpp Lowest level, max control
  • vLLM Production serving
  • Jan.ai Desktop UI + API
  • GPT4All Offline-first UI

New models drop every week

Get notified which ones your GPU can run, with VRAM data, benchmarks, and setup guides. Free.

Unsubscribe anytime.  ← Back to LocalLLMGuide