Easiest to start

Ollama

Local LLM in one command — the easiest start

Overview

Ollama is a developer tool built on top of llama.cpp that lets you run local LLMs with one command. It downloads models from its own registry, stores them in GGUF format, and exposes an OpenAI-compatible API. From install to first chat: 5 minutes.

Single-stream focused: lightning fast for one user at a time, but not designed for 50+ concurrent production requests. Auto-detects hardware: Metal on Mac, CUDA on NVIDIA, ROCm on AMD.

Installation

# Homebrew
brew install ollama

# veya .dmg installer:
# https://ollama.com/download/Ollama-darwin.zip

ollama serve &
ollama --version

Configuration

Pull and run your first model — Ollama auto-picks the right quantization for your hardware (usually Q4_K_M).

Pull & chatbash

# Model çek (otomatik Q4_K_M)
ollama pull llama3.1:8b

# Veya direkt çalıştır (yoksa otomatik indirir)
ollama run llama3.1:8b
> Merhaba, kendini tanıt
... cevap ...
> /bye

# Yüklü modelleri listele
ollama list
ollama show llama3.1:8b

Environment variablesbash

# Model deposunun yeri (varsayılan: ~/.ollama/models)
export OLLAMA_MODELS=/Volumes/External/ollama-models

# Aynı anda yüklü kalacak model sayısı
export OLLAMA_MAX_LOADED_MODELS=2

# Eşzamanlı istek
export OLLAMA_NUM_PARALLEL=4

# Context length (varsayılan 2048!)
export OLLAMA_CONTEXT_LENGTH=8192

# Sunucu adresi (uzak erişim için)
export OLLAMA_HOST=0.0.0.0:11434

# GPU katmanı (otomatik algılar, override için)
export OLLAMA_NUM_GPU=999  # tüm katmanları GPU'ya

Modelfile — your own variantdockerfile

# Modelfile
FROM llama3.1:8b

PARAMETER temperature 0.3
PARAMETER num_ctx 8192
PARAMETER top_p 0.9

SYSTEM """Sen Türkçe konuşan bir kod asistanısın.
Kısa ve teknik cevap ver."""

# Build & run:
# ollama create mybot -f Modelfile
# ollama run mybot

Hardware acceleration

Ollama auto-detects hardware — no config needed. On Apple Silicon it uses Metal Performance Shaders for GPU + Neural Engine. CUDA on NVIDIA, ROCm on AMD. CPU fallback works but is slow above 7B.

Apple Silicon (M1/M2/M3/M4)	Metal (automatic), unified memory
NVIDIA GPU	CUDA 11.4+ (automatic), all RTX series
AMD GPU	ROCm (Linux), RDNA2/3 family
Intel Mac	CPU only — slow
CPU fallback	✓
Multi-GPU	Single node, no tensor parallelism

Model formats & quantization

Format: GGUF (the llama.cpp standard). Each model ships in multiple quantization variants; without an explicit choice, Ollama pulls Q4_K_M. The official registry has Llama, Mistral, Qwen, DeepSeek, Phi, Gemma, etc. You can also pull any GGUF from Hugging Face.

Picking a quantizationbash

# Varsayılan (Q4_K_M)
ollama pull llama3.1:8b

# Daha iyi kalite, daha çok RAM
ollama pull llama3.1:8b-instruct-q5_K_M
ollama pull llama3.1:8b-instruct-q8_0

# Daha küçük (3-4 GB), kalite kaybı kabul edilebilir
ollama pull llama3.1:8b-instruct-q3_K_M

# Hugging Face'den GGUF
ollama pull hf.co/bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M

Q2_K	Tiny — quality drops, only for experiments
Q3_K_M	Small, acceptable quality
Q4_K_M ★	Sweet spot — default, balanced
Q5_K_M	Higher quality, 25% more VRAM
Q8_0	Near-FP16 quality, 2× size
FP16	Lossless but large (8B = 16 GB)

API

Two APIs: a native one (port 11434, /api/chat, /api/generate) and an OpenAI-compatible one (/v1/chat/completions). Existing OpenAI clients work by just swapping base_url.

OpenAI-compatible call (Python)Python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # gerekli ama kullanılmıyor
)

resp = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Selam!"}],
    stream=True,
)
for chunk in resp:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Native API (curl)bash

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1:8b",
  "messages": [{"role":"user","content":"Selam"}],
  "options": {
    "temperature": 0.3,
    "num_ctx": 8192,
    "num_predict": 256
  }
}'

Performance

Single-stream tps (M2 Max, Llama-3.1-8B Q4)	~50 tok/s
Single-stream tps (RTX 4090, Llama-3.1-8B Q4)	~110 tok/s
Time-to-first-token (TTFT)	100-300 ms (hot model)
Concurrent requests	Tunable via OLLAMA_NUM_PARALLEL, but well below vLLM
Cold model load	5-30s (depends on size + disk)
RAM/VRAM efficiency	Medium — KV cache not paged

Common pitfalls

Default context is 2048Without OLLAMA_CONTEXT_LENGTH or PARAMETER num_ctx in your Modelfile, you're stuck at 2048. Empty long-context responses usually trace back to this.
Not designed for productionNo continuous batching, no paged attention. Latency explodes past ~10 concurrent users. Use vLLM for production serving.
Auto-quant guess can missOllama looks at your RAM and picks a quantization, but can over-compress small models. For real work, manually pull Q5_K_M or Q8_0.
Remote access is insecure by defaultSetting OLLAMA_HOST=0.0.0.0 exposes port 11434 to the world with no auth. Front it with a reverse proxy + auth (Caddy, nginx, Tailscale).

Resources

ollama.com GitHub Model registry API docs Modelfile