Install One Tool. Run Any LLM Locally in 5 Minutes.
Ollama is the fastest way to run large language models on your own hardware. One command installs it. One command downloads a model. One command starts a conversation.
No API keys. No monthly bills. No data sent to anyone's servers. Your conversations stay on your machine.
This guide covers everything: installation on all platforms, pulling and managing models, which models to use for different tasks, advanced configuration with Modelfiles, API mode for building apps, and how to fix the most common errors.
---
1. Installation
📬 Enjoying this guide?
Get these updates in your inbox every week
New VRAM data, model benchmarks, and setup guides — straight to you. Free.
Ollama supports macOS, Linux, and Windows. Pick your platform:
macOS
Download the macOS app from ollama.com/download and drag it to your Applications folder. Or install via Homebrew:
brew install ollamaAfter installing, Ollama runs as a background service automatically. You'll see the llama icon in your menu bar. Apple Silicon (M1/M2/M3/M4): Ollama uses Metal GPU acceleration automatically. M-series chips share CPU and GPU memory, so your full RAM is available for models — an M3 Max with 48GB RAM can run 32B parameter models without breaking a sweat.
Linux
One command installs Ollama and sets it up as a systemd service:
curl -fsSL https://ollama.com/install.sh | shVerify the service is running:
systemctl status ollama
NVIDIA GPU on Linux: Install NVIDIA drivers and the CUDA toolkit first. Ollama detects CUDA automatically — no extra configuration needed.
# Check GPU is detected
ollama run llama3.2 --verbose
Look for "GPU layers: X/Y" in the output
AMD GPU on Linux: ROCm support is built in. Install rocm-hip-sdk and Ollama will use your AMD GPU.
Windows
Download the installer from ollama.com/download. Run it, follow the prompts. Ollama starts automatically and runs in the system tray. Windows GPU support: NVIDIA CUDA works out of the box if your drivers are current. AMD GPUs are supported via DirectML. Apple Silicon is macOS-only. Verify installation works on any platform:
ollama --version
ollama version 0.6.x
---
2. Your First Model: The "Hello World" Moment
Pull Llama 3.2 — Meta's fast, capable 3B parameter model:
ollama pull llama3.2You'll see a progress bar as the model downloads (~2GB for the default Q4 quantization). Once it's done:
ollama run llama3.2That's it. You're now talking to a local LLM:
>>> Hello! What can you do?
I'm a large language model trained by Meta. I can help with writing, coding,
analysis, answering questions, brainstorming, and much more. What would you
like to work on?
>>> Send a message (/? for help)
Type /bye or press Ctrl+D to exit the conversation.
Run a one-shot prompt (no interactive mode):
ollama run llama3.2 "Explain recursion in one paragraph"---
3. Model Management
Ollama stores models locally in ~/.ollama/models. Here are the commands you'll use constantly:
Pull a model
ollama pull llama3.2 # Latest Llama 3.2 3B (default)
ollama pull llama3.2:1b # Smallest, fastest version
ollama pull llama4:scout # Llama 4 Scout — best chat model
ollama pull qwen2.5-coder:32b # Best coding model
ollama pull deepseek-r1:14b # Strong reasoning model
List downloaded models
ollama list
NAME ID SIZE MODIFIED
llama3.2:latest a80c4f17acd5 2.0 GB 2 days ago
llama4:scout c8c1253f3f28 9.8 GB 1 day ago
qwen2.5-coder:32b 40b4eb523a7d 19 GB 3 hours ago
Inspect a model
ollama show llama3.2
Shows model details: parameters, quantization, context length, template
Remove a model
ollama rm llama3.2
Frees up the disk space
Copy a model (to create a custom version)
ollama cp llama3.2 my-assistant
Creates a copy you can customize with a Modelfile
Check what's running
ollama ps
Shows models currently loaded in memory
NAME ID SIZE PROCESSOR UNTIL
llama3.2:latest ... 3.8 GB GPU 4 minutes from now
Ollama unloads a model from memory after 5 minutes of inactivity by default. This frees VRAM for other models.
---
4. Best Models by Use Case
Not every model is good at everything. Here's what to use for each task:
General Chat & Writing
Llama 4 Scout is the top pick for conversational AI in 2026. Meta's multimodal model with 17B active parameters (109B total MoE) handles nuanced conversation, long documents, and creative writing better than anything else in its weight class.ollama pull llama4:scout
ollama run llama4:scout
Minimum VRAM: ~6GB (Q4) | Check if your GPU can run Llama 4 Scout →
For lighter hardware: llama3.2:3b runs on 4GB VRAM and handles casual conversation well.
Coding
Qwen2.5-Coder 32B is the best local coding model — it matches GPT-4o on multiple coding benchmarks. Use it for code generation, debugging, code review, and refactoring.ollama pull qwen2.5-coder:32b # Best quality, needs 20GB+ VRAM
ollama pull qwen2.5-coder:14b # Great balance, needs 10GB VRAM
ollama pull qwen2.5-coder:7b # Budget option, needs 5GB VRAM
Check if your GPU can run Qwen2.5-Coder →
DeepSeek-Coder V2 is a strong alternative with a 236B total parameter MoE architecture that feels much larger than its VRAM footprint:
ollama pull deepseek-coder-v2:16b # 10GB VRAM
Writing & Long-Form Content
GLM-4 from Tsinghua University excels at structured writing, reports, and long-form generation. It handles Chinese and English equally well:ollama pull glm4:9b # 6GB VRAM
ollama pull glm4:27b # 16GB VRAM
Mistral models are also excellent for writing — fast, instruction-following, and highly customizable:
ollama pull mistral:7b
Small & Fast (Low VRAM / CPU)
If you're on older hardware, a laptop, or want instant responses:
ollama pull phi4:mini # Microsoft Phi-4 Mini, 3.8B, ~2.5GB VRAM
ollama pull gemma3:4b # Google Gemma 3, strong reasoning, ~3GB VRAM
ollama pull llama3.2:1b # Fastest, bare minimum 1GB VRAM
ollama pull qwen2.5:3b # Good all-rounder, ~2GB VRAM
Reasoning & Math
DeepSeek-R1 uses chain-of-thought reasoning and dramatically outperforms base models on math, logic, and complex analysis:ollama pull deepseek-r1:7b # 5GB VRAM — great value
ollama pull deepseek-r1:14b # 9GB VRAM — noticeably better
ollama pull deepseek-r1:32b # 20GB VRAM — near frontier quality---
5. VRAM Requirements
The single biggest factor in what models you can run is your GPU's VRAM. Here's a quick reference:
| VRAM | What You Can Run | Example GPUs |
|---|---|---|
| 4 GB | Up to 3B models (Q4), 1B models comfortably | RTX 3060 (laptop), GTX 1650 Super |
| 6 GB | Up to 7B models (Q4), 3B models (Q8) | RTX 3060 12GB, RTX 4060 |
| 8 GB | 7B models (Q8), 13B models (Q4) | RTX 3070, RX 6700 XT, RTX 4060 Ti 8GB |
| 12 GB | 13B models (Q8), 20B models (Q4) | RTX 3060 12GB, RTX 4070 |
| 16 GB | 20B models (Q8), 32B models (Q4) | RTX 4060 Ti 16GB, RX 7900 XT |
| 24 GB | 32B models (Q8), 70B models (Q4) | RTX 3090, RTX 4090, RX 7900 XTX |
| 32 GB | 70B models (Q8) | RTX 5090, M3 Max 36GB, M4 Pro 24GB |
| 48 GB+ | Any model, highest quality | M3 Max 48GB, M4 Max, dual RTX 3090 |
The calculator accounts for quantization level, context length, and model architecture — far more accurate than the simple table above.
GPU offloading (not enough VRAM?)
If a model is larger than your VRAM, Ollama automatically offloads layers to CPU RAM. It's slower but it works:
# Check how many layers are on GPU vs CPU
ollama run llama3.3:70b --verbose
"GPU layers: 40/81" means 40 layers on GPU, 41 on CPU
For CPU-only inference, Ollama uses llama.cpp and runs entirely in RAM. Expect 1–5 tokens/second on modern CPUs for 7B models, compared to 30–80 tokens/second on a decent GPU.
> Want to know which GPU to buy? Read our best GPU for local LLMs guide for 2026 recommendations at every price point.
---
6. Using the Ollama REST API
Ollama runs a local REST API server at http://localhost:11434. This is what makes it useful for building apps — any program can talk to your local LLM the same way it would talk to OpenAI's API.
Generate a completion
curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "prompt": "Why is the sky blue?", "stream": false}'
Chat with message history
curl http://localhost:11434/api/chat -d '
{
"model": "llama3.2",
"messages": [
{"role": "user", "content": "What is machine learning?"},
{"role": "assistant", "content": "Machine learning is..."},
{"role": "user", "content": "Give me a simple example"}
]
}'
OpenAI-compatible endpoint
Ollama exposes an OpenAI-compatible API at /v1/chat/completions. Any app built for OpenAI can be pointed at Ollama with zero code changes:
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:11434/v1",
apiKey: "ollama", // Required but ignored
});
const response = await client.chat.completions.create({
model: "llama3.2",
messages: [{ role: "user", content: "Hello!" }],
});
console.log(response.choices[0].message.content);
Python with the official Ollama library
pip install ollama
import ollama
response = ollama.chat(
model="llama3.2",
messages=[{"role": "user", "content": "What's the capital of France?"}]
)
print(response["message"]["content"])
Stream responses
import ollama
for chunk in ollama.chat(
model="llama3.2",
messages=[{"role": "user", "content": "Tell me a story"}],
stream=True
):
print(chunk["message"]["content"], end="", flush=True)
---
7. Custom Modelfiles
A Modelfile lets you create a custom version of any model with a persistent system prompt, specific parameters, and even a custom name. This is how you build specialized assistants from base models.
Basic Modelfile
Create a file called Modelfile (no extension):
FROM llama3.2
SYSTEM """
You are a senior software engineer specializing in Python and system design.
You give direct, precise answers with working code examples.
When reviewing code, you identify bugs and suggest improvements.
You never say "Great question!" or add unnecessary filler.
"""
PARAMETER temperature 0.3
PARAMETER num_ctx 8192
Build and run it:
ollama create my-engineer -f ./Modelfile
ollama run my-engineer
Key Modelfile parameters
| Parameter | What it does | Typical range |
|---|---|---|
temperature | Creativity vs consistency. Lower = more deterministic | 0.0–1.0 |
num_ctx | Context window size (tokens). Larger = more memory | 2048–131072 |
top_p | Nucleus sampling — filters low-probability tokens | 0.7–1.0 |
top_k | Limits token selection pool | 20–100 |
repeat_penalty | Penalizes repetition | 1.0–1.3 |
num_gpu | GPU layers to use (0 = CPU only) | 0–99 |
System prompt best practices
The system prompt is the most powerful customization. Short, specific prompts beat long generic ones:
# Good — specific role with clear constraints
SYSTEM "You are a terse technical writer. Output markdown only. No preamble."
Bad — vague and generic
SYSTEM "You are a helpful AI assistant that helps users with various tasks."
Pre-loading conversation history
You can inject example conversations to prime the model's behavior:
FROM llama3.2
SYSTEM "You answer questions about SQL databases concisely."
MESSAGE user "How do I get the 5 most recent rows?"
MESSAGE assistant "SELECT * FROM table ORDER BY id DESC LIMIT 5;"
MESSAGE user "How do I delete duplicates?"
MESSAGE assistant "DELETE FROM t WHERE id NOT IN (SELECT MIN(id) FROM t GROUP BY col);"
---
8. Advanced Configuration
Increase context window
By default, most models use a 2048–4096 token context. For longer documents:
# Set context to 32K for this session
ollama run llama3.2 --ctx-size 32768
Or set permanently in Modelfile
PARAMETER num_ctx 32768
Caution: Larger context uses more VRAM. Double-check with the VRAM calculator before increasing significantly.
Run Ollama on a specific IP/port
By default, Ollama only listens on 127.0.0.1:11434. To share it on your local network:
# macOS / Linux
OLLAMA_HOST=0.0.0.0:11434 ollama serve
Or set permanently via environment variable
export OLLAMA_HOST=0.0.0.0:11434
Security note: Never expose port 11434 to the internet without authentication. Anyone with access can run queries and consume your GPU.
Keep models loaded in memory
Ollama unloads models after 5 minutes by default. Override this:
# Keep loaded indefinitely
curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "keep_alive": -1}'
Unload immediately after use
curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "keep_alive": 0}'
Or set globally
export OLLAMA_KEEP_ALIVE=10m
Environment variables reference
| Variable | Default | What it does |
|---|---|---|
OLLAMA_HOST | 127.0.0.1:11434 | IP and port to listen on |
OLLAMA_MODELS | ~/.ollama/models | Where models are stored |
OLLAMA_KEEP_ALIVE | 5m | How long to keep models loaded |
OLLAMA_NUM_PARALLEL | 1 | Parallel requests (needs extra VRAM) |
OLLAMA_MAX_LOADED_MODELS | 1 | Max models in memory simultaneously |
OLLAMA_FLASH_ATTENTION | 0 | Enable Flash Attention (reduces VRAM) |
Enable Flash Attention
Flash Attention reduces memory usage with minimal quality impact — recommended for most setups:
export OLLAMA_FLASH_ATTENTION=1
ollama serve---
9. Troubleshooting Common Errors
"Error: model not found"
# Check available models
ollama list
Pull the missing model
ollama pull llama3.2
Check exact model name on ollama.com/library
ollama pull llama3.2:latest
Also check spelling — llama3.2 is correct, not llama-3.2 or llama_3_2.
VRAM out of memory (OOM)
Error: GPU out of memory. Requested X bytes but only Y bytes available
Solutions:
ollama pull llama3.3:70b-instruct-q2_K instead of the default Q4llama3.2:1b instead of llama3.2:3bnvidia-smi on Linux/Windows, Activity Monitor on Mac# Check VRAM usage on NVIDIA
nvidia-smi
Check on Mac
sudo powermetrics --samplers gpu_power -n 1
Use the VRAM calculator to find models that fit your GPU →
Slow inference (low tokens/second)
Expected speeds for reference:
| Setup | Tokens/second (7B Q4) |
|---|---|
| CPU only (modern) | 3–8 |
| RTX 3060 12GB | 25–40 |
| RTX 4070 | 50–70 |
| RTX 4090 | 100–130 |
| M3 Max | 60–90 |
# Check if GPU is being used (look for GPU layers count)
ollama run llama3.2 --verbose
Check if model is in VRAM or RAM
ollama psCommon causes of unexpectedly slow inference:
- Model not using GPU: Drivers not installed, CUDA not found → reinstall CUDA
- Model in CPU RAM: Not enough VRAM for full GPU load → use smaller model or enable Flash Attention
- Wrong model format: Some models are fp16 by default, much larger than Q4
Ollama not starting
# Linux: check service status
systemctl status ollama
journalctl -u ollama -n 50
macOS: start manually
ollama serve
Check if port is in use
lsof -i :11434
Connection refused (API not responding)
# Start the Ollama server manually
ollama serve
Test if it's running
curl http://localhost:11434
Should return: "Ollama is running"
If ollama serve throws an error, Ollama is probably already running in the background (system tray on macOS/Windows). Kill and restart it.
Model download stuck / corrupted
# Force re-download
ollama rm llama3.2
ollama pull llama3.2
Check disk space — large models need 4–50GB free
df -h ~/.ollama---
What to Try Next
You're set up and running. Here's where to go from here:
- Find the right model for your hardware: VRAM Calculator — enter your GPU, see exactly which models fit
- Picking a GPU: Best GPU for Local LLMs — 2026 buying guide at every price point
- Coding assistant: Best LLMs for Coding — set up a local GitHub Copilot alternative
- Understand quantization: LLM Quantization Explained — why Q4 vs Q8 matters for your hardware
- Compare tools: Ollama vs LM Studio vs Jan — when to use which
Get weekly model updates — VRAM data, benchmarks & setup guides
Know which new models your GPU can run before you download 4GB of weights. Free.