Meta's Llama 4 Scout runs on a single consumer GPU. Here's how to set it up in 10 minutes.
Llama 4 Scout is a Mixture-of-Experts model with 109B total parameters but only 17B active at inference time. That MoE architecture is why a 109B model fits on hardware that previously maxed out at 13B dense models. At Q4 quantization, it needs ~6GB of VRAM — an RTX 3060 handles it.
This guide covers three methods: Ollama (easiest), llama.cpp (most control), and vLLM (production serving). Pick the one that matches your use case.
VRAM Requirements
Use the VRAM calculator to check your specific GPU. General requirements for Llama 4 Scout:
| Quantization | VRAM Required | Quality vs FP16 | Best For |
|---|---|---|---|
| Q4_K_M | ~6–8 GB | ~97% | Most users, RTX 3060 / RX 6700 |
| Q5_K_M | ~9–10 GB | ~98.5% | RTX 3070 / RTX 4070 |
| Q6_K | ~11–13 GB | ~99% | RTX 3080 / RTX 4080 |
| Q8_0 | ~14–16 GB | ~99.8% | RTX 3090 / RTX 4090 |
| FP16 | ~24 GB | 100% | A100, H100, dual 3090 |
Method 1 — Ollama (Recommended)
📬 Enjoying this guide?
Get these updates in your inbox every week
New VRAM data, model benchmarks, and setup guides — straight to you. Free.
Ollama is the fastest path to a working Llama 4 install. One command to pull, one to run.
Install Ollama
macOS / Linux:curl -fsSL https://ollama.com/install.sh | sh
Windows: Download the installer from ollama.com/download
Pull and Run Llama 4 Scout
# Pull the model (downloads ~5GB for Q4 default)
ollama pull llama4:scout
Start an interactive chat
ollama run llama4:scout
Or pull Maverick (larger, more capable)
ollama pull llama4:maverick
ollama run llama4:maverickOllama automatically selects the best quantization for your hardware. It detects your GPU VRAM and downloads the highest quality variant that fits.
Use via API
Once the model is running, Ollama exposes an OpenAI-compatible API at localhost:11434:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama4:scout", "messages": [{"role": "user", "content": "Explain MoE in one paragraph"}]}'
Or use it with any OpenAI SDK by pointing the base URL at http://localhost:11434/v1.
Ollama Llama 4 Options
# List available Llama 4 variants
ollama list | grep llama4
Pull a specific quantization
ollama pull llama4:scout-q8_0
Check model info and VRAM usage
ollama show llama4:scout
Method 2 — llama.cpp (Maximum Control)
llama.cpp gives you direct control over quantization, GPU layer offloading, context size, and batching. Use it when you need to squeeze performance or run custom GGUF files.
Build llama.cpp
# Clone the repo
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
Build with CUDA (NVIDIA GPUs)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
Build with Metal (Apple Silicon)
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j$(sysctl -n hw.logicalcpu)
Build with ROCm (AMD GPUs)
cmake -B build -DGGML_HIPBLAS=ON
cmake --build build --config Release -j$(nproc)
Download Llama 4 GGUF
# Install huggingface-hub CLI
pip install huggingface-hub
Download Q4_K_M (recommended starting point)
huggingface-cli download bartowski/Llama-4-Scout-17B-16E-Instruct-GGUF \
--include "Llama-4-Scout-17B-16E-Instruct-Q4_K_M.gguf" \
--local-dir ./models
Or Q8_0 for near-lossless quality (needs 16GB+ VRAM)
huggingface-cli download bartowski/Llama-4-Scout-17B-16E-Instruct-GGUF \
--include "Llama-4-Scout-17B-16E-Instruct-Q8_0.gguf" \
--local-dir ./models
Run with GPU Offloading
# -ngl 99 offloads all layers to GPU (fastest)
-c 8192 sets context window to 8K tokens
--threads 8 CPU threads for non-GPU layers
./build/bin/llama-server \
-m ./models/Llama-4-Scout-17B-16E-Instruct-Q4_K_M.gguf \
-ngl 99 \
-c 8192 \
--threads 8 \
--port 8080
The server starts an OpenAI-compatible API at localhost:8080. Open localhost:8080 in your browser for the built-in chat UI.
Tuning -ngl (GPU layers):
| VRAM | Recommended -ngl | Notes |
|---|---|---|
| 6 GB | 20–30 | Partial GPU offload, slower |
| 8 GB | 40–50 | ~70% on GPU, decent speed |
| 12 GB | 70–80 | Most layers on GPU |
| 16 GB+ | 99 | Full GPU offload, maximum speed |
-ngl by 10 until stable.
Method 3 — vLLM / TGI (Production Serving)
Use vLLM or Text Generation Inference (TGI) when you're serving Llama 4 to multiple users or need maximum throughput.
vLLM
# Install vLLM
pip install vllm
Serve Llama 4 Scout (requires HuggingFace access token)
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
--tensor-parallel-size 1 \
--dtype bfloat16 \
--max-model-len 32768
For quantized inference (AWQ, fits in 10GB VRAM)
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
--quantization awq \
--dtype auto
vLLM advantages: PagedAttention for high concurrency, continuous batching, OpenAI-compatible API, supports tensor parallelism across multiple GPUs.
Text Generation Inference (TGI)
# Run via Docker (easiest for TGI)
docker run --gpus all \
-v $HOME/.cache/huggingface:/data \
-p 8080:80 \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-4-Scout-17B-16E-Instruct \
--max-input-length 8192 \
--max-total-tokens 16384
When to use vLLM vs TGI vs Ollama:
| Use Case | Best Tool |
|---|---|
| Personal use, quick setup | Ollama |
| Fine-grained control, custom GGUF | llama.cpp |
| Multi-user API server | vLLM |
| Production deployment, Docker | TGI |
Performance Benchmarks
Community-tested token generation speeds for Llama 4 Scout Q4_K_M (single-user, 512 token output):
| Hardware | Tokens/sec | Notes |
|---|---|---|
| RTX 4090 (24GB) | 48–56 tok/s | Full GPU offload, Q4_K_M |
| RTX 3090 (24GB) | 32–40 tok/s | Full GPU offload, Q4_K_M |
| RTX 4080 (16GB) | 38–44 tok/s | Full GPU offload, Q4_K_M |
| RTX 4070 Ti (12GB) | 28–34 tok/s | Full GPU offload, Q4_K_M |
| RTX 3080 (10GB) | 22–28 tok/s | Partial offload needed for Q8 |
| RTX 3060 (12GB) | 18–24 tok/s | Comfortable for Q4_K_M |
| M3 Ultra (96GB UMA) | 38–46 tok/s | Full model in unified memory |
| M2 Ultra (76GB UMA) | 30–36 tok/s | Full model in unified memory |
| M3 Max (48GB UMA) | 26–32 tok/s | Full model in unified memory |
These are generation speeds (not prefill/prompt processing). Prompt processing is faster; long contexts slow generation slightly.
Llama 4 Scout vs Maverick
Meta released two Llama 4 models at launch. Here is when to use each:
| Llama 4 Scout | Llama 4 Maverick | |
|---|---|---|
| Total Parameters | 109B (MoE) | 400B (MoE) |
| Active Parameters | 17B | 17B |
| VRAM (Q4_K_M) | ~6–8 GB | ~20–24 GB |
| Context Window | 10M tokens | 1M tokens |
| Best For | Single GPU, long context tasks | Multi-GPU, highest reasoning quality |
| Coding | Good | Better |
| Reasoning | Good | Best in class |
| Speed | Fast (6–8 GB active) | Same active params, larger weight file |
The 10M token context window on Scout is its killer feature — Maverick caps at 1M. If you need to feed entire codebases or document libraries in a single prompt, Scout wins.
Troubleshooting
Out of Memory (OOM) Errors
Symptoms:CUDA out of memory, RuntimeError: CUDA error, or llama.cpp crashes at startup.
Fixes:
# 1. Reduce GPU layers (llama.cpp)
Instead of -ngl 99, try -ngl 40 or -ngl 20
./build/bin/llama-server -m model.gguf -ngl 40
2. Use a smaller quantization
Q8_0 → Q5_K_M → Q4_K_M → Q4_0
3. Reduce context window
Default is often 4096 or 8192 — try 2048
./build/bin/llama-server -m model.gguf -ngl 99 -c 2048
4. Close other GPU processes
nvidia-smi # check what else is using VRAM
Slow Inference (< 5 tok/s)
Cause: Model is running on CPU instead of GPU, or too few layers offloaded. Fix:# Check if CUDA build is active
./build/bin/llama-server --version
Should show "CUDA" or "Metal" in build info
Verify GPU is being used
nvidia-smi dmon -s u # watch GPU utilization while generating
Force all layers to GPU
./build/bin/llama-server -m model.gguf -ngl 999
If you built llama.cpp without -DGGML_CUDA=ON, it runs on CPU only. Rebuild with the correct flag.
Quantization Quality Loss
If outputs feel degraded (repetitive, incoherent, or wrong factually), you may be over-quantized for your use case. Rule of thumb:
- Creative writing, chat: Q4_K_M is fine
- Coding, logic, math: Q5_K_M or higher
- Production/accuracy-critical: Q8_0 or FP16
Ollama Model Not Found
# If "llama4:scout" is not found, check available tags
ollama list
Pull with explicit tag
ollama pull llama4:scout-q4_K_M
Or try the base llama4 tag
ollama pull llama4Ollama model names can change between releases. Check ollama.com/library/llama4 for current available tags.
llama.cpp Build Errors
# Missing CUDA toolkit
nvcc --version # should return CUDA version
If missing: install from https://developer.nvidia.com/cuda-downloads
CMake version too old (need 3.14+)
cmake --version
Update via: pip install cmake --upgrade
Metal errors on Apple Silicon
Ensure Xcode command line tools are installed
xcode-select --install
Next Steps
Once Llama 4 is running locally:
- Check GPU compatibility first: VRAM Calculator →
- Understand quantization tradeoffs: Quantization Guide (Q4/Q5/Q8/FP16)
- Pick the right GPU: Best GPUs for Local LLMs in 2026
- Master Ollama: Complete Ollama Guide 2026
- Compare to cloud AI: Local LLM vs ChatGPT — Real Cost Analysis
Get weekly model updates — VRAM data, benchmarks & setup guides
Know which new models your GPU can run before you download 4GB of weights. Free.