Ollama
Local LLM in one command — the easiest start
Overview
Ollama is a developer tool built on top of llama.cpp that lets you run local LLMs with one command. It downloads models from its own registry, stores them in GGUF format, and exposes an OpenAI-compatible API. From install to first chat: 5 minutes.
Single-stream focused: lightning fast for one user at a time, but not designed for 50+ concurrent production requests. Auto-detects hardware: Metal on Mac, CUDA on NVIDIA, ROCm on AMD.
Installation
# Homebrew
brew install ollama
# veya .dmg installer:
# https://ollama.com/download/Ollama-darwin.zip
ollama serve &
ollama --versionConfiguration
Pull and run your first model — Ollama auto-picks the right quantization for your hardware (usually Q4_K_M).
# Model çek (otomatik Q4_K_M)
ollama pull llama3.1:8b
# Veya direkt çalıştır (yoksa otomatik indirir)
ollama run llama3.1:8b
> Merhaba, kendini tanıt
... cevap ...
> /bye
# Yüklü modelleri listele
ollama list
ollama show llama3.1:8b# Model deposunun yeri (varsayılan: ~/.ollama/models)
export OLLAMA_MODELS=/Volumes/External/ollama-models
# Aynı anda yüklü kalacak model sayısı
export OLLAMA_MAX_LOADED_MODELS=2
# Eşzamanlı istek
export OLLAMA_NUM_PARALLEL=4
# Context length (varsayılan 2048!)
export OLLAMA_CONTEXT_LENGTH=8192
# Sunucu adresi (uzak erişim için)
export OLLAMA_HOST=0.0.0.0:11434
# GPU katmanı (otomatik algılar, override için)
export OLLAMA_NUM_GPU=999 # tüm katmanları GPU'ya# Modelfile
FROM llama3.1:8b
PARAMETER temperature 0.3
PARAMETER num_ctx 8192
PARAMETER top_p 0.9
SYSTEM """Sen Türkçe konuşan bir kod asistanısın.
Kısa ve teknik cevap ver."""
# Build & run:
# ollama create mybot -f Modelfile
# ollama run mybotHardware acceleration
Ollama auto-detects hardware — no config needed. On Apple Silicon it uses Metal Performance Shaders for GPU + Neural Engine. CUDA on NVIDIA, ROCm on AMD. CPU fallback works but is slow above 7B.
| Apple Silicon (M1/M2/M3/M4) | Metal (automatic), unified memory |
| NVIDIA GPU | CUDA 11.4+ (automatic), all RTX series |
| AMD GPU | ROCm (Linux), RDNA2/3 family |
| Intel Mac | CPU only — slow |
| CPU fallback | ✓ |
| Multi-GPU | Single node, no tensor parallelism |
Model formats & quantization
Format: GGUF (the llama.cpp standard). Each model ships in multiple quantization variants; without an explicit choice, Ollama pulls Q4_K_M. The official registry has Llama, Mistral, Qwen, DeepSeek, Phi, Gemma, etc. You can also pull any GGUF from Hugging Face.
# Varsayılan (Q4_K_M)
ollama pull llama3.1:8b
# Daha iyi kalite, daha çok RAM
ollama pull llama3.1:8b-instruct-q5_K_M
ollama pull llama3.1:8b-instruct-q8_0
# Daha küçük (3-4 GB), kalite kaybı kabul edilebilir
ollama pull llama3.1:8b-instruct-q3_K_M
# Hugging Face'den GGUF
ollama pull hf.co/bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M| Q2_K | Tiny — quality drops, only for experiments |
| Q3_K_M | Small, acceptable quality |
| Q4_K_M ★ | Sweet spot — default, balanced |
| Q5_K_M | Higher quality, 25% more VRAM |
| Q8_0 | Near-FP16 quality, 2× size |
| FP16 | Lossless but large (8B = 16 GB) |
API
Two APIs: a native one (port 11434, /api/chat, /api/generate) and an OpenAI-compatible one (/v1/chat/completions). Existing OpenAI clients work by just swapping base_url.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # gerekli ama kullanılmıyor
)
resp = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "Selam!"}],
stream=True,
)
for chunk in resp:
print(chunk.choices[0].delta.content or "", end="", flush=True)curl http://localhost:11434/api/chat -d '{
"model": "llama3.1:8b",
"messages": [{"role":"user","content":"Selam"}],
"options": {
"temperature": 0.3,
"num_ctx": 8192,
"num_predict": 256
}
}'Performance
| Single-stream tps (M2 Max, Llama-3.1-8B Q4) | ~50 tok/s |
| Single-stream tps (RTX 4090, Llama-3.1-8B Q4) | ~110 tok/s |
| Time-to-first-token (TTFT) | 100-300 ms (hot model) |
| Concurrent requests | Tunable via OLLAMA_NUM_PARALLEL, but well below vLLM |
| Cold model load | 5-30s (depends on size + disk) |
| RAM/VRAM efficiency | Medium — KV cache not paged |
Common pitfalls
- Default context is 2048Without OLLAMA_CONTEXT_LENGTH or PARAMETER num_ctx in your Modelfile, you're stuck at 2048. Empty long-context responses usually trace back to this.
- Not designed for productionNo continuous batching, no paged attention. Latency explodes past ~10 concurrent users. Use vLLM for production serving.
- Auto-quant guess can missOllama looks at your RAM and picks a quantization, but can over-compress small models. For real work, manually pull Q5_K_M or Q8_0.
- Remote access is insecure by defaultSetting OLLAMA_HOST=0.0.0.0 exposes port 11434 to the world with no auth. Front it with a reverse proxy + auth (Caddy, nginx, Tailscale).