18 min read

Complete Guide to Running LLMs with Ollama (2026)

Install one tool, run any LLM locally in 5 minutes. This is the only Ollama tutorial you need — from installation through advanced Modelfiles, API mode, and troubleshooting.

Install One Tool. Run Any LLM Locally in 5 Minutes.

Ollama is the fastest way to run large language models on your own hardware. One command installs it. One command downloads a model. One command starts a conversation.

No API keys. No monthly bills. No data sent to anyone's servers. Your conversations stay on your machine.

This guide covers everything: installation on all platforms, pulling and managing models, which models to use for different tasks, advanced configuration with Modelfiles, API mode for building apps, and how to fix the most common errors.

---

1. Installation

📬 Enjoying this guide?

Get these updates in your inbox every week

New VRAM data, model benchmarks, and setup guides — straight to you. Free.

Ollama supports macOS, Linux, and Windows. Pick your platform:

macOS

Download the macOS app from ollama.com/download and drag it to your Applications folder. Or install via Homebrew:

brew install ollama

After installing, Ollama runs as a background service automatically. You'll see the llama icon in your menu bar. Apple Silicon (M1/M2/M3/M4): Ollama uses Metal GPU acceleration automatically. M-series chips share CPU and GPU memory, so your full RAM is available for models — an M3 Max with 48GB RAM can run 32B parameter models without breaking a sweat.

Linux

One command installs Ollama and sets it up as a systemd service:

curl -fsSL https://ollama.com/install.sh | sh

Verify the service is running:

systemctl status ollama
NVIDIA GPU on Linux: Install NVIDIA drivers and the CUDA toolkit first. Ollama detects CUDA automatically — no extra configuration needed.
# Check GPU is detected
ollama run llama3.2 --verbose

Look for "GPU layers: X/Y" in the output

AMD GPU on Linux: ROCm support is built in. Install rocm-hip-sdk and Ollama will use your AMD GPU.

Windows

Download the installer from ollama.com/download. Run it, follow the prompts. Ollama starts automatically and runs in the system tray. Windows GPU support: NVIDIA CUDA works out of the box if your drivers are current. AMD GPUs are supported via DirectML. Apple Silicon is macOS-only. Verify installation works on any platform:

ollama --version

ollama version 0.6.x

---

2. Your First Model: The "Hello World" Moment

Pull Llama 3.2 — Meta's fast, capable 3B parameter model:

ollama pull llama3.2

You'll see a progress bar as the model downloads (~2GB for the default Q4 quantization). Once it's done:

ollama run llama3.2

That's it. You're now talking to a local LLM:

>>> Hello! What can you do?
I'm a large language model trained by Meta. I can help with writing, coding,
analysis, answering questions, brainstorming, and much more. What would you
like to work on?

>>> Send a message (/? for help)

Type /bye or press Ctrl+D to exit the conversation. Run a one-shot prompt (no interactive mode):

ollama run llama3.2 "Explain recursion in one paragraph"

---

3. Model Management

Ollama stores models locally in ~/.ollama/models. Here are the commands you'll use constantly:

Pull a model

ollama pull llama3.2          # Latest Llama 3.2 3B (default)
ollama pull llama3.2:1b       # Smallest, fastest version
ollama pull llama4:scout      # Llama 4 Scout — best chat model
ollama pull qwen2.5-coder:32b # Best coding model
ollama pull deepseek-r1:14b   # Strong reasoning model

List downloaded models

ollama list

NAME ID SIZE MODIFIED

llama3.2:latest a80c4f17acd5 2.0 GB 2 days ago

llama4:scout c8c1253f3f28 9.8 GB 1 day ago

qwen2.5-coder:32b 40b4eb523a7d 19 GB 3 hours ago

Inspect a model

ollama show llama3.2

Shows model details: parameters, quantization, context length, template

Remove a model

ollama rm llama3.2

Frees up the disk space

Copy a model (to create a custom version)

ollama cp llama3.2 my-assistant

Creates a copy you can customize with a Modelfile

Check what's running

ollama ps

Shows models currently loaded in memory

NAME ID SIZE PROCESSOR UNTIL

llama3.2:latest ... 3.8 GB GPU 4 minutes from now

Ollama unloads a model from memory after 5 minutes of inactivity by default. This frees VRAM for other models.

---

4. Best Models by Use Case

Not every model is good at everything. Here's what to use for each task:

General Chat & Writing

Llama 4 Scout is the top pick for conversational AI in 2026. Meta's multimodal model with 17B active parameters (109B total MoE) handles nuanced conversation, long documents, and creative writing better than anything else in its weight class.
ollama pull llama4:scout
ollama run llama4:scout
Minimum VRAM: ~6GB (Q4) | Check if your GPU can run Llama 4 Scout → For lighter hardware: llama3.2:3b runs on 4GB VRAM and handles casual conversation well.

Coding

Qwen2.5-Coder 32B is the best local coding model — it matches GPT-4o on multiple coding benchmarks. Use it for code generation, debugging, code review, and refactoring.
ollama pull qwen2.5-coder:32b  # Best quality, needs 20GB+ VRAM
ollama pull qwen2.5-coder:14b  # Great balance, needs 10GB VRAM
ollama pull qwen2.5-coder:7b   # Budget option, needs 5GB VRAM
Check if your GPU can run Qwen2.5-Coder → DeepSeek-Coder V2 is a strong alternative with a 236B total parameter MoE architecture that feels much larger than its VRAM footprint:
ollama pull deepseek-coder-v2:16b  # 10GB VRAM

Writing & Long-Form Content

GLM-4 from Tsinghua University excels at structured writing, reports, and long-form generation. It handles Chinese and English equally well:
ollama pull glm4:9b   # 6GB VRAM
ollama pull glm4:27b  # 16GB VRAM
Mistral models are also excellent for writing — fast, instruction-following, and highly customizable:
ollama pull mistral:7b

Small & Fast (Low VRAM / CPU)

If you're on older hardware, a laptop, or want instant responses:

ollama pull phi4:mini         # Microsoft Phi-4 Mini, 3.8B, ~2.5GB VRAM
ollama pull gemma3:4b         # Google Gemma 3, strong reasoning, ~3GB VRAM
ollama pull llama3.2:1b       # Fastest, bare minimum 1GB VRAM
ollama pull qwen2.5:3b        # Good all-rounder, ~2GB VRAM

Reasoning & Math

DeepSeek-R1 uses chain-of-thought reasoning and dramatically outperforms base models on math, logic, and complex analysis:
ollama pull deepseek-r1:7b    # 5GB VRAM — great value
ollama pull deepseek-r1:14b   # 9GB VRAM — noticeably better
ollama pull deepseek-r1:32b   # 20GB VRAM — near frontier quality

---

5. VRAM Requirements

The single biggest factor in what models you can run is your GPU's VRAM. Here's a quick reference:

VRAMWhat You Can RunExample GPUs
4 GBUp to 3B models (Q4), 1B models comfortablyRTX 3060 (laptop), GTX 1650 Super
6 GBUp to 7B models (Q4), 3B models (Q8)RTX 3060 12GB, RTX 4060
8 GB7B models (Q8), 13B models (Q4)RTX 3070, RX 6700 XT, RTX 4060 Ti 8GB
12 GB13B models (Q8), 20B models (Q4)RTX 3060 12GB, RTX 4070
16 GB20B models (Q8), 32B models (Q4)RTX 4060 Ti 16GB, RX 7900 XT
24 GB32B models (Q8), 70B models (Q4)RTX 3090, RTX 4090, RX 7900 XTX
32 GB70B models (Q8)RTX 5090, M3 Max 36GB, M4 Pro 24GB
48 GB+Any model, highest qualityM3 Max 48GB, M4 Max, dual RTX 3090
→ Use the VRAM Calculator to check your exact GPU against any model

The calculator accounts for quantization level, context length, and model architecture — far more accurate than the simple table above.

GPU offloading (not enough VRAM?)

If a model is larger than your VRAM, Ollama automatically offloads layers to CPU RAM. It's slower but it works:

# Check how many layers are on GPU vs CPU
ollama run llama3.3:70b --verbose

"GPU layers: 40/81" means 40 layers on GPU, 41 on CPU

For CPU-only inference, Ollama uses llama.cpp and runs entirely in RAM. Expect 1–5 tokens/second on modern CPUs for 7B models, compared to 30–80 tokens/second on a decent GPU.

> Want to know which GPU to buy? Read our best GPU for local LLMs guide for 2026 recommendations at every price point.

---

6. Using the Ollama REST API

Ollama runs a local REST API server at http://localhost:11434. This is what makes it useful for building apps — any program can talk to your local LLM the same way it would talk to OpenAI's API.

Generate a completion

curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "prompt": "Why is the sky blue?", "stream": false}'

Chat with message history

curl http://localhost:11434/api/chat -d '
{
  "model": "llama3.2",
  "messages": [
    {"role": "user", "content": "What is machine learning?"},
    {"role": "assistant", "content": "Machine learning is..."},
    {"role": "user", "content": "Give me a simple example"}
  ]
}'

OpenAI-compatible endpoint

Ollama exposes an OpenAI-compatible API at /v1/chat/completions. Any app built for OpenAI can be pointed at Ollama with zero code changes:

import OpenAI from "openai";

const client = new OpenAI({ baseURL: "http://localhost:11434/v1", apiKey: "ollama", // Required but ignored });

const response = await client.chat.completions.create({ model: "llama3.2", messages: [{ role: "user", content: "Hello!" }], });

console.log(response.choices[0].message.content);

Python with the official Ollama library

pip install ollama
import ollama

response = ollama.chat( model="llama3.2", messages=[{"role": "user", "content": "What's the capital of France?"}] )

print(response["message"]["content"])

Stream responses

import ollama

for chunk in ollama.chat( model="llama3.2", messages=[{"role": "user", "content": "Tell me a story"}], stream=True ): print(chunk["message"]["content"], end="", flush=True)

---

7. Custom Modelfiles

A Modelfile lets you create a custom version of any model with a persistent system prompt, specific parameters, and even a custom name. This is how you build specialized assistants from base models.

Basic Modelfile

Create a file called Modelfile (no extension):

FROM llama3.2

SYSTEM """ You are a senior software engineer specializing in Python and system design. You give direct, precise answers with working code examples. When reviewing code, you identify bugs and suggest improvements. You never say "Great question!" or add unnecessary filler. """

PARAMETER temperature 0.3 PARAMETER num_ctx 8192

Build and run it:

ollama create my-engineer -f ./Modelfile
ollama run my-engineer

Key Modelfile parameters

ParameterWhat it doesTypical range
temperatureCreativity vs consistency. Lower = more deterministic0.0–1.0
num_ctxContext window size (tokens). Larger = more memory2048–131072
top_pNucleus sampling — filters low-probability tokens0.7–1.0
top_kLimits token selection pool20–100
repeat_penaltyPenalizes repetition1.0–1.3
num_gpuGPU layers to use (0 = CPU only)0–99

System prompt best practices

The system prompt is the most powerful customization. Short, specific prompts beat long generic ones:

# Good — specific role with clear constraints
SYSTEM "You are a terse technical writer. Output markdown only. No preamble."

Bad — vague and generic

SYSTEM "You are a helpful AI assistant that helps users with various tasks."

Pre-loading conversation history

You can inject example conversations to prime the model's behavior:

FROM llama3.2

SYSTEM "You answer questions about SQL databases concisely."

MESSAGE user "How do I get the 5 most recent rows?" MESSAGE assistant "SELECT * FROM table ORDER BY id DESC LIMIT 5;"

MESSAGE user "How do I delete duplicates?" MESSAGE assistant "DELETE FROM t WHERE id NOT IN (SELECT MIN(id) FROM t GROUP BY col);"

---

8. Advanced Configuration

Increase context window

By default, most models use a 2048–4096 token context. For longer documents:

# Set context to 32K for this session
ollama run llama3.2 --ctx-size 32768

Or set permanently in Modelfile

PARAMETER num_ctx 32768

Caution: Larger context uses more VRAM. Double-check with the VRAM calculator before increasing significantly.

Run Ollama on a specific IP/port

By default, Ollama only listens on 127.0.0.1:11434. To share it on your local network:

# macOS / Linux
OLLAMA_HOST=0.0.0.0:11434 ollama serve

Or set permanently via environment variable

export OLLAMA_HOST=0.0.0.0:11434
Security note: Never expose port 11434 to the internet without authentication. Anyone with access can run queries and consume your GPU.

Keep models loaded in memory

Ollama unloads models after 5 minutes by default. Override this:

# Keep loaded indefinitely
curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "keep_alive": -1}'

Unload immediately after use

curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "keep_alive": 0}'

Or set globally

export OLLAMA_KEEP_ALIVE=10m

Environment variables reference

VariableDefaultWhat it does
OLLAMA_HOST127.0.0.1:11434IP and port to listen on
OLLAMA_MODELS~/.ollama/modelsWhere models are stored
OLLAMA_KEEP_ALIVE5mHow long to keep models loaded
OLLAMA_NUM_PARALLEL1Parallel requests (needs extra VRAM)
OLLAMA_MAX_LOADED_MODELS1Max models in memory simultaneously
OLLAMA_FLASH_ATTENTION0Enable Flash Attention (reduces VRAM)

Enable Flash Attention

Flash Attention reduces memory usage with minimal quality impact — recommended for most setups:

export OLLAMA_FLASH_ATTENTION=1
ollama serve

---

9. Troubleshooting Common Errors

"Error: model not found"

# Check available models
ollama list

Pull the missing model

ollama pull llama3.2

Check exact model name on ollama.com/library

ollama pull llama3.2:latest

Also check spelling — llama3.2 is correct, not llama-3.2 or llama_3_2.

VRAM out of memory (OOM)

Error: GPU out of memory. Requested X bytes but only Y bytes available
Solutions:
  • Use a smaller quantization: ollama pull llama3.3:70b-instruct-q2_K instead of the default Q4
  • Use a smaller model variant: llama3.2:1b instead of llama3.2:3b
  • Increase CPU offloading: Ollama will use CPU RAM for overflow automatically
  • Close other GPU-using apps: Games, video editing, other AI tools all compete for VRAM
  • Check your actual VRAM: nvidia-smi on Linux/Windows, Activity Monitor on Mac
  • # Check VRAM usage on NVIDIA
    nvidia-smi
    
    

    Check on Mac

    sudo powermetrics --samplers gpu_power -n 1
    Use the VRAM calculator to find models that fit your GPU →

    Slow inference (low tokens/second)

    Expected speeds for reference:

    SetupTokens/second (7B Q4)
    CPU only (modern)3–8
    RTX 3060 12GB25–40
    RTX 407050–70
    RTX 4090100–130
    M3 Max60–90
    If you're well below these numbers:
    # Check if GPU is being used (look for GPU layers count)
    ollama run llama3.2 --verbose
    
    

    Check if model is in VRAM or RAM

    ollama ps

    Common causes of unexpectedly slow inference:

    Ollama not starting

    # Linux: check service status
    systemctl status ollama
    journalctl -u ollama -n 50
    
    

    macOS: start manually

    ollama serve

    Check if port is in use

    lsof -i :11434

    Connection refused (API not responding)

    # Start the Ollama server manually
    ollama serve
    
    

    Test if it's running

    curl http://localhost:11434

    Should return: "Ollama is running"

    If ollama serve throws an error, Ollama is probably already running in the background (system tray on macOS/Windows). Kill and restart it.

    Model download stuck / corrupted

    # Force re-download
    ollama rm llama3.2
    ollama pull llama3.2
    
    

    Check disk space — large models need 4–50GB free

    df -h ~/.ollama

    ---

    What to Try Next

    You're set up and running. Here's where to go from here:

    Ollama is the foundation. Everything else builds on it.

    Get weekly model updates — VRAM data, benchmarks & setup guides

    Know which new models your GPU can run before you download 4GB of weights. Free.