Complete Ollama Tutorial 2026: Install & Run LLMs Locally

Install One Tool. Run Any LLM Locally in 5 Minutes.

Ollama is the fastest way to run large language models on your own hardware. One command installs it. One command downloads a model. One command starts a conversation.

No API keys. No monthly bills. No data sent to anyone's servers. Your conversations stay on your machine.

This guide covers everything: installation on all platforms, pulling and managing models, which models to use for different tasks, advanced configuration with Modelfiles, API mode for building apps, and how to fix the most common errors.

---

1. Installation

📬 Enjoying this guide?

Get these updates in your inbox every week

New VRAM data, model benchmarks, and setup guides — straight to you. Free.

Ollama supports macOS, Linux, and Windows. Pick your platform:

macOS

Download the macOS app from ollama.com/download and drag it to your Applications folder. Or install via Homebrew:

brew install ollama

After installing, Ollama runs as a background service automatically. You'll see the llama icon in your menu bar. Apple Silicon (M1/M2/M3/M4): Ollama uses Metal GPU acceleration automatically. M-series chips share CPU and GPU memory, so your full RAM is available for models — an M3 Max with 48GB RAM can run 32B parameter models without breaking a sweat.

Linux

One command installs Ollama and sets it up as a systemd service:

curl -fsSL https://ollama.com/install.sh | sh

Verify the service is running:

systemctl status ollama

NVIDIA GPU on Linux: Install NVIDIA drivers and the CUDA toolkit first. Ollama detects CUDA automatically — no extra configuration needed.

# Check GPU is detected
ollama run llama3.2 --verbose
Look for "GPU layers: X/Y" in the output

AMD GPU on Linux: ROCm support is built in. Install rocm-hip-sdk and Ollama will use your AMD GPU.

Windows

Download the installer from ollama.com/download. Run it, follow the prompts. Ollama starts automatically and runs in the system tray. Windows GPU support: NVIDIA CUDA works out of the box if your drivers are current. AMD GPUs are supported via DirectML. Apple Silicon is macOS-only. Verify installation works on any platform:

ollama --version
ollama version 0.6.x

---

2. Your First Model: The "Hello World" Moment

Pull Llama 3.2 — Meta's fast, capable 3B parameter model:

ollama pull llama3.2

You'll see a progress bar as the model downloads (~2GB for the default Q4 quantization). Once it's done:

ollama run llama3.2

That's it. You're now talking to a local LLM:

>>> Hello! What can you do?
I'm a large language model trained by Meta. I can help with writing, coding,
analysis, answering questions, brainstorming, and much more. What would you
like to work on?

>>> Send a message (/? for help)

Type /bye or press Ctrl+D to exit the conversation. Run a one-shot prompt (no interactive mode):

ollama run llama3.2 "Explain recursion in one paragraph"

---

3. Model Management

Ollama stores models locally in ~/.ollama/models. Here are the commands you'll use constantly:

Pull a model

ollama pull llama3.2          # Latest Llama 3.2 3B (default)
ollama pull llama3.2:1b       # Smallest, fastest version
ollama pull llama4:scout      # Llama 4 Scout — best chat model
ollama pull qwen2.5-coder:32b # Best coding model
ollama pull deepseek-r1:14b   # Strong reasoning model

List downloaded models

ollama list
NAME                 ID              SIZE    MODIFIED
llama3.2:latest      a80c4f17acd5    2.0 GB  2 days ago
llama4:scout         c8c1253f3f28    9.8 GB  1 day ago
qwen2.5-coder:32b    40b4eb523a7d    19 GB   3 hours ago

Inspect a model

ollama show llama3.2
Shows model details: parameters, quantization, context length, template

Remove a model

ollama rm llama3.2
Frees up the disk space

Copy a model (to create a custom version)

ollama cp llama3.2 my-assistant
Creates a copy you can customize with a Modelfile

Check what's running

ollama ps
Shows models currently loaded in memory
NAME           ID      SIZE    PROCESSOR  UNTIL
llama3.2:latest  ...   3.8 GB  GPU        4 minutes from now

Ollama unloads a model from memory after 5 minutes of inactivity by default. This frees VRAM for other models.

---

4. Best Models by Use Case

Not every model is good at everything. Here's what to use for each task:

General Chat & Writing

Llama 4 Scout is the top pick for conversational AI in 2026. Meta's multimodal model with 17B active parameters (109B total MoE) handles nuanced conversation, long documents, and creative writing better than anything else in its weight class.

ollama pull llama4:scout
ollama run llama4:scout

Minimum VRAM: ~6GB (Q4) | Check if your GPU can run Llama 4 Scout → For lighter hardware: llama3.2:3b runs on 4GB VRAM and handles casual conversation well.

Coding

Qwen2.5-Coder 32B is the best local coding model — it matches GPT-4o on multiple coding benchmarks. Use it for code generation, debugging, code review, and refactoring.

ollama pull qwen2.5-coder:32b  # Best quality, needs 20GB+ VRAM
ollama pull qwen2.5-coder:14b  # Great balance, needs 10GB VRAM
ollama pull qwen2.5-coder:7b   # Budget option, needs 5GB VRAM

Check if your GPU can run Qwen2.5-Coder → DeepSeek-Coder V2 is a strong alternative with a 236B total parameter MoE architecture that feels much larger than its VRAM footprint:

ollama pull deepseek-coder-v2:16b  # 10GB VRAM

Writing & Long-Form Content

GLM-4 from Tsinghua University excels at structured writing, reports, and long-form generation. It handles Chinese and English equally well:

ollama pull glm4:9b   # 6GB VRAM
ollama pull glm4:27b  # 16GB VRAM

Mistral models are also excellent for writing — fast, instruction-following, and highly customizable:

ollama pull mistral:7b

Small & Fast (Low VRAM / CPU)

If you're on older hardware, a laptop, or want instant responses:

ollama pull phi4:mini         # Microsoft Phi-4 Mini, 3.8B, ~2.5GB VRAM
ollama pull gemma3:4b         # Google Gemma 3, strong reasoning, ~3GB VRAM
ollama pull llama3.2:1b       # Fastest, bare minimum 1GB VRAM
ollama pull qwen2.5:3b        # Good all-rounder, ~2GB VRAM

Reasoning & Math

DeepSeek-R1 uses chain-of-thought reasoning and dramatically outperforms base models on math, logic, and complex analysis:

ollama pull deepseek-r1:7b    # 5GB VRAM — great value
ollama pull deepseek-r1:14b   # 9GB VRAM — noticeably better
ollama pull deepseek-r1:32b   # 20GB VRAM — near frontier quality

---

5. VRAM Requirements

The single biggest factor in what models you can run is your GPU's VRAM. Here's a quick reference:

VRAM	What You Can Run	Example GPUs
4 GB	Up to 3B models (Q4), 1B models comfortably	RTX 3060 (laptop), GTX 1650 Super
6 GB	Up to 7B models (Q4), 3B models (Q8)	RTX 3060 12GB, RTX 4060
8 GB	7B models (Q8), 13B models (Q4)	RTX 3070, RX 6700 XT, RTX 4060 Ti 8GB
12 GB	13B models (Q8), 20B models (Q4)	RTX 3060 12GB, RTX 4070
16 GB	20B models (Q8), 32B models (Q4)	RTX 4060 Ti 16GB, RX 7900 XT
24 GB	32B models (Q8), 70B models (Q4)	RTX 3090, RTX 4090, RX 7900 XTX
32 GB	70B models (Q8)	RTX 5090, M3 Max 36GB, M4 Pro 24GB
48 GB+	Any model, highest quality	M3 Max 48GB, M4 Max, dual RTX 3090

→ Use the VRAM Calculator to check your exact GPU against any model

The calculator accounts for quantization level, context length, and model architecture — far more accurate than the simple table above.

GPU offloading (not enough VRAM?)

If a model is larger than your VRAM, Ollama automatically offloads layers to CPU RAM. It's slower but it works:

# Check how many layers are on GPU vs CPU
ollama run llama3.3:70b --verbose
"GPU layers: 40/81" means 40 layers on GPU, 41 on CPU

For CPU-only inference, Ollama uses llama.cpp and runs entirely in RAM. Expect 1–5 tokens/second on modern CPUs for 7B models, compared to 30–80 tokens/second on a decent GPU.

> Want to know which GPU to buy? Read our best GPU for local LLMs guide for 2026 recommendations at every price point.

---

6. Using the Ollama REST API

Ollama runs a local REST API server at http://localhost:11434. This is what makes it useful for building apps — any program can talk to your local LLM the same way it would talk to OpenAI's API.

Generate a completion

curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "prompt": "Why is the sky blue?", "stream": false}'

Chat with message history

curl http://localhost:11434/api/chat -d '
{
  "model": "llama3.2",
  "messages": [
    {"role": "user", "content": "What is machine learning?"},
    {"role": "assistant", "content": "Machine learning is..."},
    {"role": "user", "content": "Give me a simple example"}
  ]
}'

OpenAI-compatible endpoint

Ollama exposes an OpenAI-compatible API at /v1/chat/completions. Any app built for OpenAI can be pointed at Ollama with zero code changes:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:11434/v1",
  apiKey: "ollama", // Required but ignored
});

const response = await client.chat.completions.create({
  model: "llama3.2",
  messages: [{ role: "user", content: "Hello!" }],
});

console.log(response.choices[0].message.content);

Python with the official Ollama library

pip install ollama

import ollama

response = ollama.chat(
  model="llama3.2",
  messages=[{"role": "user", "content": "What's the capital of France?"}]
)

print(response["message"]["content"])

Stream responses

import ollama

for chunk in ollama.chat(
  model="llama3.2",
  messages=[{"role": "user", "content": "Tell me a story"}],
  stream=True
):
  print(chunk["message"]["content"], end="", flush=True)

---

7. Custom Modelfiles

A Modelfile lets you create a custom version of any model with a persistent system prompt, specific parameters, and even a custom name. This is how you build specialized assistants from base models.

Basic Modelfile

Create a file called Modelfile (no extension):

FROM llama3.2 SYSTEM """ You are a senior software engineer specializing in Python and system design. You give direct, precise answers with working code examples. When reviewing code, you identify bugs and suggest improvements. You never say "Great question!" or add unnecessary filler. """

PARAMETER temperature 0.3 PARAMETER num_ctx 8192

Build and run it:

ollama create my-engineer -f ./Modelfile
ollama run my-engineer

Key Modelfile parameters

Parameter	What it does	Typical range
`temperature`	Creativity vs consistency. Lower = more deterministic	0.0–1.0
`num_ctx`	Context window size (tokens). Larger = more memory	2048–131072
`top_p`	Nucleus sampling — filters low-probability tokens	0.7–1.0
`top_k`	Limits token selection pool	20–100
`repeat_penalty`	Penalizes repetition	1.0–1.3
`num_gpu`	GPU layers to use (0 = CPU only)	0–99

System prompt best practices

The system prompt is the most powerful customization. Short, specific prompts beat long generic ones:

# Good — specific role with clear constraints
SYSTEM "You are a terse technical writer. Output markdown only. No preamble."

Bad — vague and generic
SYSTEM "You are a helpful AI assistant that helps users with various tasks."

Pre-loading conversation history

You can inject example conversations to prime the model's behavior:

FROM llama3.2

SYSTEM "You answer questions about SQL databases concisely."

MESSAGE user "How do I get the 5 most recent rows?"
MESSAGE assistant "SELECT * FROM table ORDER BY id DESC LIMIT 5;"

MESSAGE user "How do I delete duplicates?"
MESSAGE assistant "DELETE FROM t WHERE id NOT IN (SELECT MIN(id) FROM t GROUP BY col);"

---

8. Advanced Configuration

Increase context window

By default, most models use a 2048–4096 token context. For longer documents:

# Set context to 32K for this session
ollama run llama3.2 --ctx-size 32768

Or set permanently in Modelfile
PARAMETER num_ctx 32768

Caution: Larger context uses more VRAM. Double-check with the VRAM calculator before increasing significantly.

Run Ollama on a specific IP/port

By default, Ollama only listens on 127.0.0.1:11434. To share it on your local network:

# macOS / Linux
OLLAMA_HOST=0.0.0.0:11434 ollama serve

Or set permanently via environment variable
export OLLAMA_HOST=0.0.0.0:11434

Security note: Never expose port 11434 to the internet without authentication. Anyone with access can run queries and consume your GPU.

Keep models loaded in memory

Ollama unloads models after 5 minutes by default. Override this:

# Keep loaded indefinitely
curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "keep_alive": -1}'

Unload immediately after use
curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "keep_alive": 0}'

Or set globally
export OLLAMA_KEEP_ALIVE=10m

Environment variables reference

Variable	Default	What it does
`OLLAMA_HOST`	`127.0.0.1:11434`	IP and port to listen on
`OLLAMA_MODELS`	`~/.ollama/models`	Where models are stored
`OLLAMA_KEEP_ALIVE`	`5m`	How long to keep models loaded
`OLLAMA_NUM_PARALLEL`	`1`	Parallel requests (needs extra VRAM)
`OLLAMA_MAX_LOADED_MODELS`	`1`	Max models in memory simultaneously
`OLLAMA_FLASH_ATTENTION`	`0`	Enable Flash Attention (reduces VRAM)

Enable Flash Attention

Flash Attention reduces memory usage with minimal quality impact — recommended for most setups:

export OLLAMA_FLASH_ATTENTION=1
ollama serve

---

9. Troubleshooting Common Errors

"Error: model not found"

# Check available models
ollama list

Pull the missing model
ollama pull llama3.2

Check exact model name on ollama.com/library
ollama pull llama3.2:latest

Also check spelling — llama3.2 is correct, not llama-3.2 or llama_3_2.

VRAM out of memory (OOM)

Error: GPU out of memory. Requested X bytes but only Y bytes available

Solutions:

Use a smaller quantization: ollama pull llama3.3:70b-instruct-q2_K instead of the default Q4

Use a smaller model variant: llama3.2:1b instead of llama3.2:3b

Increase CPU offloading: Ollama will use CPU RAM for overflow automatically

Close other GPU-using apps: Games, video editing, other AI tools all compete for VRAM

Check your actual VRAM: nvidia-smi on Linux/Windows, Activity Monitor on Mac

# Check VRAM usage on NVIDIA
nvidia-smi

Check on Mac
sudo powermetrics --samplers gpu_power -n 1

Use the VRAM calculator to find models that fit your GPU →

Slow inference (low tokens/second)

Expected speeds for reference:

Setup	Tokens/second (7B Q4)
CPU only (modern)	3–8
RTX 3060 12GB	25–40
RTX 4070	50–70
RTX 4090	100–130
M3 Max	60–90

If you're well below these numbers:

# Check if GPU is being used (look for GPU layers count)
ollama run llama3.2 --verbose

Check if model is in VRAM or RAM
ollama ps

Common causes of unexpectedly slow inference:

Model not using GPU: Drivers not installed, CUDA not found → reinstall CUDA
Model in CPU RAM: Not enough VRAM for full GPU load → use smaller model or enable Flash Attention
Wrong model format: Some models are fp16 by default, much larger than Q4

Ollama not starting

# Linux: check service status
systemctl status ollama
journalctl -u ollama -n 50

macOS: start manually
ollama serve

Check if port is in use
lsof -i :11434

Connection refused (API not responding)

# Start the Ollama server manually
ollama serve

Test if it's running
curl http://localhost:11434
Should return: "Ollama is running"

If ollama serve throws an error, Ollama is probably already running in the background (system tray on macOS/Windows). Kill and restart it.

Model download stuck / corrupted

# Force re-download
ollama rm llama3.2
ollama pull llama3.2

Check disk space — large models need 4–50GB free
df -h ~/.ollama

---

What to Try Next

You're set up and running. Here's where to go from here:

Find the right model for your hardware: VRAM Calculator — enter your GPU, see exactly which models fit
Picking a GPU: Best GPU for Local LLMs — 2026 buying guide at every price point
Coding assistant: Best LLMs for Coding — set up a local GitHub Copilot alternative
Understand quantization: LLM Quantization Explained — why Q4 vs Q8 matters for your hardware
Compare tools: Ollama vs LM Studio vs Jan — when to use which

Ollama is the foundation. Everything else builds on it.

Get weekly model updates — VRAM data, benchmarks & setup guides

Know which new models your GPU can run before you download 4GB of weights. Free.

Complete Guide to Running LLMs with Ollama (2026)

Install One Tool. Run Any LLM Locally in 5 Minutes.

1. Installation

Get these updates in your inbox every week

macOS

Linux

Look for "GPU layers: X/Y" in the output

Windows

ollama version 0.6.x

2. Your First Model: The "Hello World" Moment

3. Model Management

Pull a model

List downloaded models

NAME ID SIZE MODIFIED

llama3.2:latest a80c4f17acd5 2.0 GB 2 days ago

llama4:scout c8c1253f3f28 9.8 GB 1 day ago

qwen2.5-coder:32b 40b4eb523a7d 19 GB 3 hours ago

Inspect a model

Shows model details: parameters, quantization, context length, template

Remove a model

Frees up the disk space

Copy a model (to create a custom version)

Creates a copy you can customize with a Modelfile

Check what's running

Shows models currently loaded in memory

NAME ID SIZE PROCESSOR UNTIL

llama3.2:latest ... 3.8 GB GPU 4 minutes from now

4. Best Models by Use Case

General Chat & Writing

Coding

Writing & Long-Form Content

Small & Fast (Low VRAM / CPU)

Reasoning & Math

5. VRAM Requirements

GPU offloading (not enough VRAM?)

"GPU layers: 40/81" means 40 layers on GPU, 41 on CPU

6. Using the Ollama REST API

Generate a completion

Chat with message history

OpenAI-compatible endpoint

Python with the official Ollama library

Stream responses

7. Custom Modelfiles

Basic Modelfile

Key Modelfile parameters

System prompt best practices

Bad — vague and generic

Pre-loading conversation history

8. Advanced Configuration

Increase context window

Or set permanently in Modelfile

PARAMETER num_ctx 32768

Run Ollama on a specific IP/port

Or set permanently via environment variable

Keep models loaded in memory

Unload immediately after use

Or set globally

Environment variables reference

Enable Flash Attention

9. Troubleshooting Common Errors

"Error: model not found"

Pull the missing model

Check exact model name on ollama.com/library

VRAM out of memory (OOM)

Check on Mac

Slow inference (low tokens/second)

Check if model is in VRAM or RAM

Ollama not starting

macOS: start manually

Check if port is in use

Connection refused (API not responding)

Test if it's running

Should return: "Ollama is running"

Model download stuck / corrupted

Check disk space — large models need 4–50GB free

What to Try Next

Get weekly model updates — VRAM data, benchmarks & setup guides

Related Articles

`Look for "GPU layers: X/Y" in the output`

`ollama version 0.6.x`

`qwen2.5-coder:32b 40b4eb523a7d 19 GB 3 hours ago`

`Shows model details: parameters, quantization, context length, template`

`Frees up the disk space`

`Creates a copy you can customize with a Modelfile`

`llama3.2:latest ... 3.8 GB GPU 4 minutes from now`

`"GPU layers: 40/81" means 40 layers on GPU, 41 on CPU`

`PARAMETER num_ctx 32768`

`Should return: "Ollama is running"`