How to Run Llama 4 Locally: Complete Setup Guide 2026

Meta's Llama 4 Scout runs on a single consumer GPU. Here's how to set it up in 10 minutes.

Llama 4 Scout is a Mixture-of-Experts model with 109B total parameters but only 17B active at inference time. That MoE architecture is why a 109B model fits on hardware that previously maxed out at 13B dense models. At Q4 quantization, it needs ~6GB of VRAM — an RTX 3060 handles it.

This guide covers three methods: Ollama (easiest), llama.cpp (most control), and vLLM (production serving). Pick the one that matches your use case.

VRAM Requirements

Use the VRAM calculator to check your specific GPU. General requirements for Llama 4 Scout:

Quantization	VRAM Required	Quality vs FP16	Best For
Q4_K_M	~6–8 GB	~97%	Most users, RTX 3060 / RX 6700
Q5_K_M	~9–10 GB	~98.5%	RTX 3070 / RTX 4070
Q6_K	~11–13 GB	~99%	RTX 3080 / RTX 4080
Q8_0	~14–16 GB	~99.8%	RTX 3090 / RTX 4090
FP16	~24 GB	100%	A100, H100, dual 3090

Llama 4 Maverick (400B total, 17B active) has the same active-parameter count but a much larger weight file — plan for 2× the VRAM of Scout for equivalent quantization. Minimum viable setup: 8GB VRAM GPU + 16GB system RAM. For comfortable performance: 16GB VRAM.

Method 1 — Ollama (Recommended)

📬 Enjoying this guide?

Get these updates in your inbox every week

New VRAM data, model benchmarks, and setup guides — straight to you. Free.

Ollama is the fastest path to a working Llama 4 install. One command to pull, one to run.

Install Ollama

macOS / Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download the installer from ollama.com/download

Pull and Run Llama 4 Scout

# Pull the model (downloads ~5GB for Q4 default)
ollama pull llama4:scout

Start an interactive chat
ollama run llama4:scout

Or pull Maverick (larger, more capable)
ollama pull llama4:maverick
ollama run llama4:maverick

Ollama automatically selects the best quantization for your hardware. It detects your GPU VRAM and downloads the highest quality variant that fits.

Use via API

Once the model is running, Ollama exposes an OpenAI-compatible API at localhost:11434:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama4:scout", "messages": [{"role": "user", "content": "Explain MoE in one paragraph"}]}'

Or use it with any OpenAI SDK by pointing the base URL at http://localhost:11434/v1.

Ollama Llama 4 Options

# List available Llama 4 variants
ollama list | grep llama4

Pull a specific quantization
ollama pull llama4:scout-q8_0

Check model info and VRAM usage
ollama show llama4:scout

Method 2 — llama.cpp (Maximum Control)

llama.cpp gives you direct control over quantization, GPU layer offloading, context size, and batching. Use it when you need to squeeze performance or run custom GGUF files.

Build llama.cpp

# Clone the repo
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Build with CUDA (NVIDIA GPUs)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

Build with Metal (Apple Silicon)
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j$(sysctl -n hw.logicalcpu)

Build with ROCm (AMD GPUs)
cmake -B build -DGGML_HIPBLAS=ON
cmake --build build --config Release -j$(nproc)

Download Llama 4 GGUF

# Install huggingface-hub CLI
pip install huggingface-hub

Download Q4_K_M (recommended starting point)
huggingface-cli download bartowski/Llama-4-Scout-17B-16E-Instruct-GGUF \
  --include "Llama-4-Scout-17B-16E-Instruct-Q4_K_M.gguf" \
  --local-dir ./models

Or Q8_0 for near-lossless quality (needs 16GB+ VRAM)
huggingface-cli download bartowski/Llama-4-Scout-17B-16E-Instruct-GGUF \
  --include "Llama-4-Scout-17B-16E-Instruct-Q8_0.gguf" \
  --local-dir ./models

Run with GPU Offloading

# -ngl 99 offloads all layers to GPU (fastest)
-c 8192 sets context window to 8K tokens
--threads 8 CPU threads for non-GPU layers
./build/bin/llama-server \
  -m ./models/Llama-4-Scout-17B-16E-Instruct-Q4_K_M.gguf \
  -ngl 99 \
  -c 8192 \
  --threads 8 \
  --port 8080

The server starts an OpenAI-compatible API at localhost:8080. Open localhost:8080 in your browser for the built-in chat UI. Tuning -ngl (GPU layers):

VRAM	Recommended `-ngl`	Notes
6 GB	20–30	Partial GPU offload, slower
8 GB	40–50	~70% on GPU, decent speed
12 GB	70–80	Most layers on GPU
16 GB+	99	Full GPU offload, maximum speed

If you see OOM errors, reduce -ngl by 10 until stable.

Method 3 — vLLM / TGI (Production Serving)

Use vLLM or Text Generation Inference (TGI) when you're serving Llama 4 to multiple users or need maximum throughput.

vLLM

# Install vLLM
pip install vllm

Serve Llama 4 Scout (requires HuggingFace access token)
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --tensor-parallel-size 1 \
  --dtype bfloat16 \
  --max-model-len 32768

For quantized inference (AWQ, fits in 10GB VRAM)
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --quantization awq \
  --dtype auto

vLLM advantages: PagedAttention for high concurrency, continuous batching, OpenAI-compatible API, supports tensor parallelism across multiple GPUs.

Text Generation Inference (TGI)

# Run via Docker (easiest for TGI)
docker run --gpus all \
  -v $HOME/.cache/huggingface:/data \
  -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --max-input-length 8192 \
  --max-total-tokens 16384

When to use vLLM vs TGI vs Ollama:

Use Case	Best Tool
Personal use, quick setup	Ollama
Fine-grained control, custom GGUF	llama.cpp
Multi-user API server	vLLM
Production deployment, Docker	TGI

Performance Benchmarks

Community-tested token generation speeds for Llama 4 Scout Q4_K_M (single-user, 512 token output):

Hardware	Tokens/sec	Notes
RTX 4090 (24GB)	48–56 tok/s	Full GPU offload, Q4_K_M
RTX 3090 (24GB)	32–40 tok/s	Full GPU offload, Q4_K_M
RTX 4080 (16GB)	38–44 tok/s	Full GPU offload, Q4_K_M
RTX 4070 Ti (12GB)	28–34 tok/s	Full GPU offload, Q4_K_M
RTX 3080 (10GB)	22–28 tok/s	Partial offload needed for Q8
RTX 3060 (12GB)	18–24 tok/s	Comfortable for Q4_K_M
M3 Ultra (96GB UMA)	38–46 tok/s	Full model in unified memory
M2 Ultra (76GB UMA)	30–36 tok/s	Full model in unified memory
M3 Max (48GB UMA)	26–32 tok/s	Full model in unified memory

For Q8_0 on an RTX 4090: expect ~28–34 tok/s (higher quality, lower speed).

These are generation speeds (not prefill/prompt processing). Prompt processing is faster; long contexts slow generation slightly.

Llama 4 Scout vs Maverick

Meta released two Llama 4 models at launch. Here is when to use each:

Llama 4 Scout	Llama 4 Maverick
Total Parameters	109B (MoE)	400B (MoE)
Active Parameters	17B	17B
VRAM (Q4_K_M)	~6–8 GB	~20–24 GB
Context Window	10M tokens	1M tokens
Best For	Single GPU, long context tasks	Multi-GPU, highest reasoning quality
Coding	Good	Better
Reasoning	Good	Best in class
Speed	Fast (6–8 GB active)	Same active params, larger weight file

Use Scout if: You have a single consumer GPU (8–16GB), need long context (>100K tokens), or want the fastest setup. Use Maverick if: You have a high-end workstation (RTX 3090/4090 or multi-GPU), need best-in-class reasoning, or are running a production API server.

The 10M token context window on Scout is its killer feature — Maverick caps at 1M. If you need to feed entire codebases or document libraries in a single prompt, Scout wins.

Troubleshooting

Out of Memory (OOM) Errors

Symptoms: CUDA out of memory, RuntimeError: CUDA error, or llama.cpp crashes at startup. Fixes:

# 1. Reduce GPU layers (llama.cpp)
Instead of -ngl 99, try -ngl 40 or -ngl 20
./build/bin/llama-server -m model.gguf -ngl 40

2. Use a smaller quantization
Q8_0 → Q5_K_M → Q4_K_M → Q4_0

3. Reduce context window
Default is often 4096 or 8192 — try 2048
./build/bin/llama-server -m model.gguf -ngl 99 -c 2048

4. Close other GPU processes
nvidia-smi  # check what else is using VRAM

Slow Inference (< 5 tok/s)

Cause: Model is running on CPU instead of GPU, or too few layers offloaded. Fix:

# Check if CUDA build is active
./build/bin/llama-server --version
Should show "CUDA" or "Metal" in build info

Verify GPU is being used
nvidia-smi dmon -s u  # watch GPU utilization while generating

Force all layers to GPU
./build/bin/llama-server -m model.gguf -ngl 999

If you built llama.cpp without -DGGML_CUDA=ON, it runs on CPU only. Rebuild with the correct flag.

Quantization Quality Loss

If outputs feel degraded (repetitive, incoherent, or wrong factually), you may be over-quantized for your use case. Rule of thumb:

Creative writing, chat: Q4_K_M is fine
Coding, logic, math: Q5_K_M or higher
Production/accuracy-critical: Q8_0 or FP16

Check the quantization guide for a detailed quality comparison.

Ollama Model Not Found

# If "llama4:scout" is not found, check available tags
ollama list

Pull with explicit tag
ollama pull llama4:scout-q4_K_M

Or try the base llama4 tag
ollama pull llama4

Ollama model names can change between releases. Check ollama.com/library/llama4 for current available tags.

llama.cpp Build Errors

# Missing CUDA toolkit
nvcc --version  # should return CUDA version
If missing: install from https://developer.nvidia.com/cuda-downloads

CMake version too old (need 3.14+)
cmake --version
Update via: pip install cmake --upgrade

Metal errors on Apple Silicon
Ensure Xcode command line tools are installed
xcode-select --install

Next Steps

Once Llama 4 is running locally:

Check GPU compatibility first: VRAM Calculator →
Understand quantization tradeoffs: Quantization Guide (Q4/Q5/Q8/FP16)
Pick the right GPU: Best GPUs for Local LLMs in 2026
Master Ollama: Complete Ollama Guide 2026
Compare to cloud AI: Local LLM vs ChatGPT — Real Cost Analysis

Get weekly model updates — VRAM data, benchmarks & setup guides

Know which new models your GPU can run before you download 4GB of weights. Free.

How to Run Llama 4 Locally: Complete Setup Guide (2026)

VRAM Requirements

Method 1 — Ollama (Recommended)

Get these updates in your inbox every week

Install Ollama

Pull and Run Llama 4 Scout

Start an interactive chat

Or pull Maverick (larger, more capable)

Use via API

Ollama Llama 4 Options

Pull a specific quantization

Check model info and VRAM usage

Method 2 — llama.cpp (Maximum Control)

Build llama.cpp

Build with CUDA (NVIDIA GPUs)

Build with Metal (Apple Silicon)

Build with ROCm (AMD GPUs)

Download Llama 4 GGUF

Download Q4_K_M (recommended starting point)

Or Q8_0 for near-lossless quality (needs 16GB+ VRAM)

Run with GPU Offloading

-c 8192 sets context window to 8K tokens

--threads 8 CPU threads for non-GPU layers

Method 3 — vLLM / TGI (Production Serving)

vLLM

Serve Llama 4 Scout (requires HuggingFace access token)

For quantized inference (AWQ, fits in 10GB VRAM)

Text Generation Inference (TGI)

Performance Benchmarks

Llama 4 Scout vs Maverick

Troubleshooting

Out of Memory (OOM) Errors

Instead of -ngl 99, try -ngl 40 or -ngl 20

2. Use a smaller quantization

Q8_0 → Q5_K_M → Q4_K_M → Q4_0

3. Reduce context window

Default is often 4096 or 8192 — try 2048

4. Close other GPU processes

Slow Inference (< 5 tok/s)

Should show "CUDA" or "Metal" in build info

Verify GPU is being used

Force all layers to GPU

Quantization Quality Loss

Ollama Model Not Found

Pull with explicit tag

Or try the base llama4 tag

llama.cpp Build Errors

If missing: install from https://developer.nvidia.com/cuda-downloads

CMake version too old (need 3.14+)

Update via: pip install cmake --upgrade

Metal errors on Apple Silicon

Ensure Xcode command line tools are installed

Next Steps

Get weekly model updates — VRAM data, benchmarks & setup guides

Related Articles