14 min read

How to Run Llama 4 Locally: Complete Setup Guide (2026)

Meta's Llama 4 Scout runs on a single consumer GPU. Here's how to set it up in 10 minutes using Ollama, llama.cpp, or vLLM — with real benchmark numbers and troubleshooting tips.

Meta's Llama 4 Scout runs on a single consumer GPU. Here's how to set it up in 10 minutes.

Llama 4 Scout is a Mixture-of-Experts model with 109B total parameters but only 17B active at inference time. That MoE architecture is why a 109B model fits on hardware that previously maxed out at 13B dense models. At Q4 quantization, it needs ~6GB of VRAM — an RTX 3060 handles it.

This guide covers three methods: Ollama (easiest), llama.cpp (most control), and vLLM (production serving). Pick the one that matches your use case.

VRAM Requirements

Use the VRAM calculator to check your specific GPU. General requirements for Llama 4 Scout:

QuantizationVRAM RequiredQuality vs FP16Best For
Q4_K_M~6–8 GB~97%Most users, RTX 3060 / RX 6700
Q5_K_M~9–10 GB~98.5%RTX 3070 / RTX 4070
Q6_K~11–13 GB~99%RTX 3080 / RTX 4080
Q8_0~14–16 GB~99.8%RTX 3090 / RTX 4090
FP16~24 GB100%A100, H100, dual 3090
Llama 4 Maverick (400B total, 17B active) has the same active-parameter count but a much larger weight file — plan for 2× the VRAM of Scout for equivalent quantization. Minimum viable setup: 8GB VRAM GPU + 16GB system RAM. For comfortable performance: 16GB VRAM.

Method 1 — Ollama (Recommended)

📬 Enjoying this guide?

Get these updates in your inbox every week

New VRAM data, model benchmarks, and setup guides — straight to you. Free.

Ollama is the fastest path to a working Llama 4 install. One command to pull, one to run.

Install Ollama

macOS / Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows: Download the installer from ollama.com/download

Pull and Run Llama 4 Scout

# Pull the model (downloads ~5GB for Q4 default)
ollama pull llama4:scout

Start an interactive chat

ollama run llama4:scout

Or pull Maverick (larger, more capable)

ollama pull llama4:maverick ollama run llama4:maverick

Ollama automatically selects the best quantization for your hardware. It detects your GPU VRAM and downloads the highest quality variant that fits.

Use via API

Once the model is running, Ollama exposes an OpenAI-compatible API at localhost:11434:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama4:scout", "messages": [{"role": "user", "content": "Explain MoE in one paragraph"}]}'

Or use it with any OpenAI SDK by pointing the base URL at http://localhost:11434/v1.

Ollama Llama 4 Options

# List available Llama 4 variants
ollama list | grep llama4

Pull a specific quantization

ollama pull llama4:scout-q8_0

Check model info and VRAM usage

ollama show llama4:scout

Method 2 — llama.cpp (Maximum Control)

llama.cpp gives you direct control over quantization, GPU layer offloading, context size, and batching. Use it when you need to squeeze performance or run custom GGUF files.

Build llama.cpp

# Clone the repo
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Build with CUDA (NVIDIA GPUs)

cmake -B build -DGGML_CUDA=ON cmake --build build --config Release -j$(nproc)

Build with Metal (Apple Silicon)

cmake -B build -DGGML_METAL=ON cmake --build build --config Release -j$(sysctl -n hw.logicalcpu)

Build with ROCm (AMD GPUs)

cmake -B build -DGGML_HIPBLAS=ON cmake --build build --config Release -j$(nproc)

Download Llama 4 GGUF

# Install huggingface-hub CLI
pip install huggingface-hub

Download Q4_K_M (recommended starting point)

huggingface-cli download bartowski/Llama-4-Scout-17B-16E-Instruct-GGUF \ --include "Llama-4-Scout-17B-16E-Instruct-Q4_K_M.gguf" \ --local-dir ./models

Or Q8_0 for near-lossless quality (needs 16GB+ VRAM)

huggingface-cli download bartowski/Llama-4-Scout-17B-16E-Instruct-GGUF \ --include "Llama-4-Scout-17B-16E-Instruct-Q8_0.gguf" \ --local-dir ./models

Run with GPU Offloading

# -ngl 99 offloads all layers to GPU (fastest)

-c 8192 sets context window to 8K tokens

--threads 8 CPU threads for non-GPU layers

./build/bin/llama-server \ -m ./models/Llama-4-Scout-17B-16E-Instruct-Q4_K_M.gguf \ -ngl 99 \ -c 8192 \ --threads 8 \ --port 8080

The server starts an OpenAI-compatible API at localhost:8080. Open localhost:8080 in your browser for the built-in chat UI. Tuning -ngl (GPU layers):

VRAMRecommended -nglNotes
6 GB20–30Partial GPU offload, slower
8 GB40–50~70% on GPU, decent speed
12 GB70–80Most layers on GPU
16 GB+99Full GPU offload, maximum speed
If you see OOM errors, reduce -ngl by 10 until stable.

Method 3 — vLLM / TGI (Production Serving)

Use vLLM or Text Generation Inference (TGI) when you're serving Llama 4 to multiple users or need maximum throughput.

vLLM

# Install vLLM
pip install vllm

Serve Llama 4 Scout (requires HuggingFace access token)

vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \ --tensor-parallel-size 1 \ --dtype bfloat16 \ --max-model-len 32768

For quantized inference (AWQ, fits in 10GB VRAM)

vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \ --quantization awq \ --dtype auto
vLLM advantages: PagedAttention for high concurrency, continuous batching, OpenAI-compatible API, supports tensor parallelism across multiple GPUs.

Text Generation Inference (TGI)

# Run via Docker (easiest for TGI)
docker run --gpus all \
  -v $HOME/.cache/huggingface:/data \
  -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --max-input-length 8192 \
  --max-total-tokens 16384
When to use vLLM vs TGI vs Ollama:
Use CaseBest Tool
Personal use, quick setupOllama
Fine-grained control, custom GGUFllama.cpp
Multi-user API servervLLM
Production deployment, DockerTGI

Performance Benchmarks

Community-tested token generation speeds for Llama 4 Scout Q4_K_M (single-user, 512 token output):

HardwareTokens/secNotes
RTX 4090 (24GB)48–56 tok/sFull GPU offload, Q4_K_M
RTX 3090 (24GB)32–40 tok/sFull GPU offload, Q4_K_M
RTX 4080 (16GB)38–44 tok/sFull GPU offload, Q4_K_M
RTX 4070 Ti (12GB)28–34 tok/sFull GPU offload, Q4_K_M
RTX 3080 (10GB)22–28 tok/sPartial offload needed for Q8
RTX 3060 (12GB)18–24 tok/sComfortable for Q4_K_M
M3 Ultra (96GB UMA)38–46 tok/sFull model in unified memory
M2 Ultra (76GB UMA)30–36 tok/sFull model in unified memory
M3 Max (48GB UMA)26–32 tok/sFull model in unified memory
For Q8_0 on an RTX 4090: expect ~28–34 tok/s (higher quality, lower speed).

These are generation speeds (not prefill/prompt processing). Prompt processing is faster; long contexts slow generation slightly.

Llama 4 Scout vs Maverick

Meta released two Llama 4 models at launch. Here is when to use each:

Llama 4 ScoutLlama 4 Maverick
Total Parameters109B (MoE)400B (MoE)
Active Parameters17B17B
VRAM (Q4_K_M)~6–8 GB~20–24 GB
Context Window10M tokens1M tokens
Best ForSingle GPU, long context tasksMulti-GPU, highest reasoning quality
CodingGoodBetter
ReasoningGoodBest in class
SpeedFast (6–8 GB active)Same active params, larger weight file
Use Scout if: You have a single consumer GPU (8–16GB), need long context (>100K tokens), or want the fastest setup. Use Maverick if: You have a high-end workstation (RTX 3090/4090 or multi-GPU), need best-in-class reasoning, or are running a production API server.

The 10M token context window on Scout is its killer feature — Maverick caps at 1M. If you need to feed entire codebases or document libraries in a single prompt, Scout wins.

Troubleshooting

Out of Memory (OOM) Errors

Symptoms: CUDA out of memory, RuntimeError: CUDA error, or llama.cpp crashes at startup. Fixes:
# 1. Reduce GPU layers (llama.cpp)

Instead of -ngl 99, try -ngl 40 or -ngl 20

./build/bin/llama-server -m model.gguf -ngl 40

2. Use a smaller quantization

Q8_0 → Q5_K_M → Q4_K_M → Q4_0

3. Reduce context window

Default is often 4096 or 8192 — try 2048

./build/bin/llama-server -m model.gguf -ngl 99 -c 2048

4. Close other GPU processes

nvidia-smi # check what else is using VRAM

Slow Inference (< 5 tok/s)

Cause: Model is running on CPU instead of GPU, or too few layers offloaded. Fix:
# Check if CUDA build is active
./build/bin/llama-server --version

Should show "CUDA" or "Metal" in build info

Verify GPU is being used

nvidia-smi dmon -s u # watch GPU utilization while generating

Force all layers to GPU

./build/bin/llama-server -m model.gguf -ngl 999

If you built llama.cpp without -DGGML_CUDA=ON, it runs on CPU only. Rebuild with the correct flag.

Quantization Quality Loss

If outputs feel degraded (repetitive, incoherent, or wrong factually), you may be over-quantized for your use case. Rule of thumb:

Check the quantization guide for a detailed quality comparison.

Ollama Model Not Found

# If "llama4:scout" is not found, check available tags
ollama list

Pull with explicit tag

ollama pull llama4:scout-q4_K_M

Or try the base llama4 tag

ollama pull llama4

Ollama model names can change between releases. Check ollama.com/library/llama4 for current available tags.

llama.cpp Build Errors

# Missing CUDA toolkit
nvcc --version  # should return CUDA version

If missing: install from https://developer.nvidia.com/cuda-downloads

CMake version too old (need 3.14+)

cmake --version

Update via: pip install cmake --upgrade

Metal errors on Apple Silicon

Ensure Xcode command line tools are installed

xcode-select --install

Next Steps

Once Llama 4 is running locally:

Get weekly model updates — VRAM data, benchmarks & setup guides

Know which new models your GPU can run before you download 4GB of weights. Free.