How much disk space do local LLM models use?

A 7B model at Q4 quantization uses about 4–5 GB. A 13B model takes 7–9 GB. A 70B model at Q4 takes around 40 GB. An SSD is strongly recommended for fast model loading.

How long does it take to set up a local LLM?

About 5 minutes for installation plus model download time. Ollama installs in under a minute. A 7B model download (4–5 GB) takes 2–10 minutes depending on your internet speed.

How to Run LLMs Locally — Complete 2026 Setup Guide

Q: Can I run LLMs locally without a GPU?

Yes. Ollama falls back to CPU-only inference if no compatible GPU is found. Expect 2–5 tokens per second on CPU vs. 30–80+ on a GPU. Small models like Gemma 3 4B are the most usable on CPU-only hardware.

Q: Are local LLMs as good as ChatGPT?

For many tasks, yes. 7B models are weaker on complex reasoning but competitive on code generation, writing, and Q&A. Models like Llama 3.3 70B and DeepSeek R1 32B match or exceed GPT-3.5 on many benchmarks.

Q: Can I run local LLMs on a Mac?

Yes. Apple Silicon Macs (M1, M2, M3, M4) are excellent for local LLMs. They use unified memory, so a 16GB M2 MacBook Pro runs 13B models well. Ollama has native Metal GPU support.

Running AI locally means running the model on your own computer — your CPU and GPU do the work, your data never leaves your machine, and there are no API bills. This guide gets you from zero to chatting with a local LLM in about 5 minutes. No experience required. If you can open a terminal and run a command, you can do this.

---

Why Run LLMs Locally?

Before you dive in, here's why it's worth the 5-minute setup:

Privacy — Your prompts and responses stay on your hardware. Nothing goes to OpenAI, Anthropic, or anyone else.
No API costs — After the one-time model download, every query is free. Run it 10,000 times today and pay $0.
Works offline — On a plane, in a basement, in a data-sensitive environment. No internet required after setup.
No rate limits — Need to process 5,000 documents? Go ahead. No throttling, no waiting.
Full control — Customize system prompts, context length, and model behavior without restrictions.

---

What You Need

📬 Enjoying this guide?

Get these updates in your inbox every week

New VRAM data, model benchmarks, and setup guides — straight to you. Free.

Local LLMs run on a wide range of hardware. Here's what matters:

Hardware	What It Enables
Any modern CPU (no GPU)	Small models (1B–3B params). Slow but functional.
8GB GPU VRAM	7B parameter models at full speed. Best value entry point.
16GB GPU VRAM	13B models. Near-GPT-3.5 quality locally.
24GB+ GPU VRAM	33B+ models. Near-GPT-4 quality.
Apple Silicon (M1/M2/M3)	Uses unified memory. 16GB M2 runs 13B models well.

Not sure what your GPU can handle? Use our VRAM calculator to check exactly which models fit your hardware.

If you're looking to upgrade your GPU for local AI, our GPU buying guide breaks down the best options for every budget. Minimum to get started: Any computer with 8GB RAM. Expect slow inference on CPU-only, but it works.

---

Step 1: Choose Your Runner

A runner is the software that loads and executes LLM models. You need one. Here's the quick decision tree: Are you comfortable with a terminal?

Yes → Use Ollama. It's fast, has the largest model library, and runs a REST API out of the box. Best for developers and anyone comfortable with command lines.
No → Use LM Studio. It's a polished desktop GUI — download, click, chat. No terminal required.
Privacy is your #1 priority → Use GPT4All. Everything is offline by design.

For a detailed comparison of all three, see our Ollama vs LM Studio vs GPT4All guide.

This guide uses Ollama for the hands-on steps — it's the most popular choice and the easiest to automate.

---

Step 2: Install Ollama

Ollama is a single-binary installer. It handles model downloads, GPU detection, and running the inference server automatically.

macOS

brew install ollama

Or download the Mac app directly from ollama.com.

Linux

curl -fsSL https://ollama.ai/install.sh | sh

This installs Ollama as a system service. It auto-detects NVIDIA and AMD GPUs.

Windows

Download the installer from ollama.com and run it. Works natively on Windows 10/11 — no WSL required. Verify it installed correctly:

ollama --version

You should see something like ollama version 0.6.x. If so, you're ready.

---

Step 3: Download Your First Model

Now you'll pull a model. This downloads the model weights to your local machine (usually stored in ~/.ollama/models).

Recommended for Beginners

If you have 8GB+ VRAM (or 16GB+ RAM for CPU):

ollama pull llama3.1:8b

Llama 3.1 8B is Meta's flagship open-source model. It's fast, capable, and fits on almost any modern GPU. Great general-purpose model for chat, writing, coding, and Q&A. If you have less than 8GB VRAM:

ollama pull gemma3:4b

Google's Gemma 3 4B is surprisingly capable for its size. Runs on 4GB VRAM or even CPU-only. Good starting point for lower-end hardware. If you want a strong coding model:

ollama pull qwen2.5-coder:7b

Qwen 2.5 Coder 7B is one of the best small coding models available. Excellent for code generation, debugging, and explanation. Download size: 7B models at Q4 quantization are about 4–5 GB. The download takes a few minutes depending on your connection speed. Ollama shows a progress bar.

---

Step 4: Chat With Your Model

Option A: Interactive Terminal Chat

Once the download completes, start chatting immediately:

ollama run llama3.1:8b

You'll see a >>> prompt. Type your message and press Enter.

>>> Explain quantum entanglement in simple terms
Quantum entanglement is when two particles become linked...
>>> /bye

Type /bye to exit, or press Ctrl+D.

Option B: Web UI (Recommended for Most Users)

The terminal chat is fine for testing, but for daily use you want a proper chat interface. Open WebUI is the most popular option — it looks and feels like ChatGPT but runs entirely locally.

Install it via Docker (fastest method):

docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 in your browser. It auto-connects to Ollama.

Option C: API (For Developers)

Ollama starts a REST API server at localhost:11434. You can query it from any code:

# Start the server (if not already running)
ollama serve

Query via curl
curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.1:8b", "prompt": "Why is the sky blue?", "stream": false}'

The API is OpenAI-compatible, so you can drop in any LLM library that supports OpenAI's format.

---

Step 5: What's Next

You have a working local LLM. Here's where to go from here: Try a bigger model. If you have 16GB+ VRAM, step up to a 13B model. The quality jump is significant:

ollama pull phi4:14b       # Microsoft's Phi-4 — excellent reasoning
ollama pull deepseek-r1:14b  # Best for math and logic

Experiment with quantization. The default Q4 quantization is a good balance of quality and size. If you have spare VRAM, try Q8 for better output:

ollama pull llama3.1:8b-instruct-q8_0

List your models:

ollama list

Remove a model you're done with:

ollama rm gemma3:4b

Check if your GPU is being used:

# Linux/Windows
nvidia-smi

Mac
sudo powermetrics --gpu-power-overhead 1000 | grep GPU

If GPU usage spikes when you send prompts, the model is running on GPU (fast). If your CPU maxes out instead, you're running CPU-only (slower but still works).

---

Quick Reference

Task	Command
Install Ollama (Linux)	`curl -fsSL https://ollama.ai/install.sh \`	sh
Download a model	`ollama pull llama3.1:8b`
Chat interactively	`ollama run llama3.1:8b`
Start API server	`ollama serve`
List models	`ollama list`
Remove a model	`ollama rm modelname`
Check version	`ollama --version`

---

FAQ

Do I need an internet connection to run LLMs locally?

Only for the initial model download. Once downloaded, models run entirely offline. No internet connection required for inference.

Can I run LLMs locally without a GPU?

Yes. Ollama falls back to CPU if no compatible GPU is found. It's significantly slower (expect 2–5 tokens/second vs. 30–80+ on a GPU), but works fine for light use. Small models like Gemma 3 4B are most usable on CPU.

How much disk space do models take?

A 7B model at Q4 quantization uses about 4–5 GB. A 13B model takes 7–9 GB. A 70B model at Q4 takes around 40 GB. Plan your storage accordingly — SSD is strongly recommended for fast load times.

Are local LLMs as good as ChatGPT?

On some tasks, yes. Smaller models (7B) are noticeably weaker than GPT-4 on complex reasoning, but competitive on code generation, writing assistance, and Q&A. Models like Llama 3.3 70B and DeepSeek R1 32B match or exceed GPT-3.5 on many benchmarks.

What's the difference between Ollama, LM Studio, and GPT4All?

Ollama is CLI-first and developer-focused with an API server. LM Studio is a polished GUI app for desktop users. GPT4All prioritizes privacy and offline use. All run the same underlying models. See our full comparison guide for details.

Can I run local LLMs on a Mac?

Yes — Apple Silicon Macs (M1, M2, M3, M4) are excellent for local LLMs. They use unified memory, so a 16GB M2 MacBook can run 13B models well. Ollama has native Metal support.

Is running LLMs locally free?

After the one-time model download, inference is completely free. No API keys, no subscription, no per-token charges. The only "cost" is your electricity.

Get weekly model updates — VRAM data, benchmarks & setup guides

Know which new models your GPU can run before you download 4GB of weights. Free.