Ollama Complete Guide | Run LLMs Locally, API, Open WebUI

Ollama Complete Guide | Run LLMs Locally, API, Open WebUI

이 글의 핵심

Ollama lets you run powerful open-source LLMs (Llama 3, Mistral, Gemma, Phi) on your own hardware — no API keys, no usage costs, full privacy. This guide covers everything from first install to production API integration.

What This Guide Covers

Ollama makes running open-source LLMs as simple as ollama run llama3. This guide covers model management, the REST API, integration with Python and Node.js, and running a local ChatGPT-like UI.

Real-world insight: Running Llama 3.1 8B locally with Ollama on an M2 Mac costs $0/month and delivers GPT-3.5-level quality for most coding and writing tasks.


Installation

# macOS (Homebrew)
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download installer from ollama.com

Start the server:

ollama serve
# Server runs on http://localhost:11434

On macOS, Ollama runs as a menu bar app automatically.


1. Running Models

# Download and run (interactive chat)
ollama run llama3.2

# Specific model size
ollama run llama3.2:3b    # 3 billion params (~2GB)
ollama run llama3.1:8b    # 8 billion params (~5GB)
ollama run llama3.1:70b   # 70 billion params (~40GB, needs high-end GPU)

# Other popular models
ollama run mistral         # Mistral 7B — fast, good for code
ollama run gemma3          # Google Gemma 3
ollama run phi4            # Microsoft Phi-4 — small but capable
ollama run qwen2.5-coder   # Best for code generation
ollama run deepseek-r1     # Reasoning model

Once the model downloads, you get an interactive chat. Type /bye to exit.


2. Model Management

# List downloaded models
ollama list

# Pull a model without running
ollama pull mistral

# Show model info
ollama show llama3.2

# Delete a model
ollama rm llama3.2:3b

# Copy/rename a model
ollama cp llama3.2 my-llama

# Check running models
ollama ps

3. REST API

Ollama exposes an OpenAI-compatible REST API at http://localhost:11434:

Generate (completion)

curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.2",
    "prompt": "Explain WebSockets in one paragraph.",
    "stream": false
  }'

Chat (multi-turn conversation)

curl http://localhost:11434/api/chat \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "system", "content": "You are a senior software engineer."},
      {"role": "user", "content": "What is the difference between REST and GraphQL?"}
    ],
    "stream": false
  }'

OpenAI-compatible endpoint

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

This endpoint is drop-in compatible with the OpenAI SDK.


4. Python Integration

With the ollama package

pip install ollama
import ollama

# Simple generation
response = ollama.generate(
    model='llama3.2',
    prompt='Write a Python function to parse JSON safely.'
)
print(response['response'])

# Chat with history
response = ollama.chat(
    model='llama3.2',
    messages=[
        {'role': 'system', 'content': 'You are a helpful coding assistant.'},
        {'role': 'user', 'content': 'How do I reverse a list in Python?'},
    ]
)
print(response['message']['content'])

# Streaming
for chunk in ollama.generate(model='llama3.2', prompt='Tell me a story', stream=True):
    print(chunk['response'], end='', flush=True)

With OpenAI SDK (drop-in replacement)

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama',  # required but ignored
)

response = client.chat.completions.create(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Hello!'}]
)
print(response.choices[0].message.content)

This lets you switch between Ollama and OpenAI by changing only base_url and api_key.


5. Node.js Integration

npm install ollama
import ollama from 'ollama';

// Generate
const response = await ollama.generate({
  model: 'llama3.2',
  prompt: 'Explain async/await in JavaScript.',
  stream: false,
});
console.log(response.response);

// Chat
const chat = await ollama.chat({
  model: 'llama3.2',
  messages: [
    { role: 'user', content: 'Write a TypeScript interface for a User object.' }
  ],
});
console.log(chat.message.content);

// Streaming
const stream = await ollama.generate({
  model: 'llama3.2',
  prompt: 'Write a blog post about TypeScript.',
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.response);
}

6. Custom Modelfiles

Create custom models with system prompts and parameters:

# Modelfile
FROM llama3.2

SYSTEM """
You are a senior TypeScript developer. Always provide type-safe code examples.
Respond concisely and include practical examples.
"""

PARAMETER temperature 0.3
PARAMETER num_ctx 8192
# Build the custom model
ollama create typescript-expert -f Modelfile

# Run it
ollama run typescript-expert

Common parameters:

  • temperature — creativity (0.0–1.0, lower = more deterministic)
  • num_ctx — context window size (tokens)
  • top_p — nucleus sampling (0.0–1.0)

7. Open WebUI — Local ChatGPT UI

# Run with Docker (easiest)
docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 — you get a full ChatGPT-like interface that connects to your local Ollama models.

Features: model switching, conversation history, file uploads, image understanding (with vision models).


8. LangChain Integration

from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage, SystemMessage

llm = ChatOllama(model="llama3.2", temperature=0.3)

messages = [
    SystemMessage(content="You are a helpful Python expert."),
    HumanMessage(content="Show me how to use dataclasses in Python."),
]

response = llm.invoke(messages)
print(response.content)

# Streaming
for chunk in llm.stream(messages):
    print(chunk.content, end="", flush=True)

9. GPU Acceleration

NVIDIA GPU (Linux/Windows)

Ollama auto-detects CUDA if drivers are installed:

# Verify GPU is being used
ollama run llama3.2
# Look for "using GPU" in logs: ollama serve (verbose)

Apple Silicon (macOS)

Metal GPU acceleration is automatic on M1/M2/M3/M4 Macs — no configuration needed.

Check GPU usage

# macOS
sudo powermetrics --samplers gpu_power -i 1000

# Linux
nvidia-smi dmon -s u

Model Recommendations

Use caseModel
General chatllama3.2:3b (fast) or llama3.1:8b (smarter)
Code generationqwen2.5-coder:7b or deepseek-coder:6.7b
Reasoning tasksdeepseek-r1:8b
Vision (images)llava:7b or llama3.2-vision
Embeddings (RAG)nomic-embed-text or mxbai-embed-large
Small / fastphi4:3.8b or gemma3:2b

Key Takeaways

  • Zero cost after initial hardware — no per-token billing
  • Full privacy — data never leaves your machine
  • OpenAI-compatible API — swap between local and cloud easily
  • Custom Modelfiles — bake in system prompts and tune parameters
  • Open WebUI — instant ChatGPT-like UI in Docker

Ollama is the fastest way to get a local LLM running. Start with ollama run llama3.2, explore the REST API, then integrate with LangChain or the OpenAI SDK for production use cases.