Running AI locally means running the model on your own computer — your CPU and GPU do the work, your data never leaves your machine, and there are no API bills. This guide gets you from zero to chatting with a local LLM in about 5 minutes. No experience required. If you can open a terminal and run a command, you can do this.
---
Why Run LLMs Locally?
Before you dive in, here's why it's worth the 5-minute setup:
- Privacy — Your prompts and responses stay on your hardware. Nothing goes to OpenAI, Anthropic, or anyone else.
- No API costs — After the one-time model download, every query is free. Run it 10,000 times today and pay $0.
- Works offline — On a plane, in a basement, in a data-sensitive environment. No internet required after setup.
- No rate limits — Need to process 5,000 documents? Go ahead. No throttling, no waiting.
- Full control — Customize system prompts, context length, and model behavior without restrictions.
What You Need
📬 Enjoying this guide?
Get these updates in your inbox every week
New VRAM data, model benchmarks, and setup guides — straight to you. Free.
Local LLMs run on a wide range of hardware. Here's what matters:
| Hardware | What It Enables |
|---|---|
| Any modern CPU (no GPU) | Small models (1B–3B params). Slow but functional. |
| 8GB GPU VRAM | 7B parameter models at full speed. Best value entry point. |
| 16GB GPU VRAM | 13B models. Near-GPT-3.5 quality locally. |
| 24GB+ GPU VRAM | 33B+ models. Near-GPT-4 quality. |
| Apple Silicon (M1/M2/M3) | Uses unified memory. 16GB M2 runs 13B models well. |
If you're looking to upgrade your GPU for local AI, our GPU buying guide breaks down the best options for every budget. Minimum to get started: Any computer with 8GB RAM. Expect slow inference on CPU-only, but it works.
---
Step 1: Choose Your Runner
A runner is the software that loads and executes LLM models. You need one. Here's the quick decision tree: Are you comfortable with a terminal?
- Yes → Use Ollama. It's fast, has the largest model library, and runs a REST API out of the box. Best for developers and anyone comfortable with command lines.
- No → Use LM Studio. It's a polished desktop GUI — download, click, chat. No terminal required.
- Privacy is your #1 priority → Use GPT4All. Everything is offline by design.
This guide uses Ollama for the hands-on steps — it's the most popular choice and the easiest to automate.
---
Step 2: Install Ollama
Ollama is a single-binary installer. It handles model downloads, GPU detection, and running the inference server automatically.
macOS
brew install ollamaOr download the Mac app directly from ollama.com.
Linux
curl -fsSL https://ollama.ai/install.sh | shThis installs Ollama as a system service. It auto-detects NVIDIA and AMD GPUs.
Windows
Download the installer from ollama.com and run it. Works natively on Windows 10/11 — no WSL required. Verify it installed correctly:
ollama --version
You should see something like ollama version 0.6.x. If so, you're ready.
---
Step 3: Download Your First Model
Now you'll pull a model. This downloads the model weights to your local machine (usually stored in ~/.ollama/models).
Recommended for Beginners
If you have 8GB+ VRAM (or 16GB+ RAM for CPU):ollama pull llama3.1:8bLlama 3.1 8B is Meta's flagship open-source model. It's fast, capable, and fits on almost any modern GPU. Great general-purpose model for chat, writing, coding, and Q&A. If you have less than 8GB VRAM:
ollama pull gemma3:4bGoogle's Gemma 3 4B is surprisingly capable for its size. Runs on 4GB VRAM or even CPU-only. Good starting point for lower-end hardware. If you want a strong coding model:
ollama pull qwen2.5-coder:7bQwen 2.5 Coder 7B is one of the best small coding models available. Excellent for code generation, debugging, and explanation. Download size: 7B models at Q4 quantization are about 4–5 GB. The download takes a few minutes depending on your connection speed. Ollama shows a progress bar.
---
Step 4: Chat With Your Model
Option A: Interactive Terminal Chat
Once the download completes, start chatting immediately:
ollama run llama3.1:8b
You'll see a >>> prompt. Type your message and press Enter.
>>> Explain quantum entanglement in simple terms
Quantum entanglement is when two particles become linked...
>>> /bye
Type /bye to exit, or press Ctrl+D.
Option B: Web UI (Recommended for Most Users)
The terminal chat is fine for testing, but for daily use you want a proper chat interface. Open WebUI is the most popular option — it looks and feels like ChatGPT but runs entirely locally.
Install it via Docker (fastest method):
docker run -d \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:mainThen open http://localhost:3000 in your browser. It auto-connects to Ollama.
Option C: API (For Developers)
Ollama starts a REST API server at localhost:11434. You can query it from any code:
# Start the server (if not already running)
ollama serve
Query via curl
curl http://localhost:11434/api/generate \
-d '{"model": "llama3.1:8b", "prompt": "Why is the sky blue?", "stream": false}'The API is OpenAI-compatible, so you can drop in any LLM library that supports OpenAI's format.
---
Step 5: What's Next
You have a working local LLM. Here's where to go from here: Try a bigger model. If you have 16GB+ VRAM, step up to a 13B model. The quality jump is significant:
ollama pull phi4:14b # Microsoft's Phi-4 — excellent reasoning
ollama pull deepseek-r1:14b # Best for math and logic
Experiment with quantization. The default Q4 quantization is a good balance of quality and size. If you have spare VRAM, try Q8 for better output:
ollama pull llama3.1:8b-instruct-q8_0
List your models:
ollama list
Remove a model you're done with:
ollama rm gemma3:4b
Check if your GPU is being used:
# Linux/Windows
nvidia-smi
Mac
sudo powermetrics --gpu-power-overhead 1000 | grep GPUIf GPU usage spikes when you send prompts, the model is running on GPU (fast). If your CPU maxes out instead, you're running CPU-only (slower but still works).
---
Quick Reference
| Task | Command | |
|---|---|---|
| Install Ollama (Linux) | curl -fsSL https://ollama.ai/install.sh \ | sh |
| Download a model | ollama pull llama3.1:8b | |
| Chat interactively | ollama run llama3.1:8b | |
| Start API server | ollama serve | |
| List models | ollama list | |
| Remove a model | ollama rm modelname | |
| Check version | ollama --version |
FAQ
Do I need an internet connection to run LLMs locally?
Only for the initial model download. Once downloaded, models run entirely offline. No internet connection required for inference.
Can I run LLMs locally without a GPU?
Yes. Ollama falls back to CPU if no compatible GPU is found. It's significantly slower (expect 2–5 tokens/second vs. 30–80+ on a GPU), but works fine for light use. Small models like Gemma 3 4B are most usable on CPU.
How much disk space do models take?
A 7B model at Q4 quantization uses about 4–5 GB. A 13B model takes 7–9 GB. A 70B model at Q4 takes around 40 GB. Plan your storage accordingly — SSD is strongly recommended for fast load times.
Are local LLMs as good as ChatGPT?
On some tasks, yes. Smaller models (7B) are noticeably weaker than GPT-4 on complex reasoning, but competitive on code generation, writing assistance, and Q&A. Models like Llama 3.3 70B and DeepSeek R1 32B match or exceed GPT-3.5 on many benchmarks.
What's the difference between Ollama, LM Studio, and GPT4All?
Ollama is CLI-first and developer-focused with an API server. LM Studio is a polished GUI app for desktop users. GPT4All prioritizes privacy and offline use. All run the same underlying models. See our full comparison guide for details.
Can I run local LLMs on a Mac?
Yes — Apple Silicon Macs (M1, M2, M3, M4) are excellent for local LLMs. They use unified memory, so a 16GB M2 MacBook can run 13B models well. Ollama has native Metal support.
Is running LLMs locally free?
After the one-time model download, inference is completely free. No API keys, no subscription, no per-token charges. The only "cost" is your electricity.
Get weekly model updates — VRAM data, benchmarks & setup guides
Know which new models your GPU can run before you download 4GB of weights. Free.