llama.cpp
C++ inference engine that runs anywhere
Overview
llama.cpp is a pure C/C++ inference engine written by Georgi Gerganov. Zero Python dependency, single static binary, runs anywhere. Tools like Ollama and LM Studio are built on top of it — but you can use it directly for more control.
Creator of the GGUF format. CPU, Metal, CUDA, Vulkan, ROCm, OpenCL, SYCL — supports nearly any hardware. Ideal for embedded, server deployment, or shipping inside your own product.
Installation
# Homebrew (Metal otomatik açık)
brew install llama.cpp
# Veya kaynaktan
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j
./build/bin/llama-cli --versionConfiguration
# Hugging Face'den GGUF indir
huggingface-cli download bartowski/Llama-3.1-8B-Instruct-GGUF \
Llama-3.1-8B-Instruct-Q4_K_M.gguf --local-dir ./models
# Sohbet
./build/bin/llama-cli \
-m ./models/Llama-3.1-8B-Instruct-Q4_K_M.gguf \
-p "Selam, kendini tanıt." \
-n 256 \
-c 8192 \
-ngl 99 # GPU katmanı (Metal/CUDA için)./build/bin/llama-server \
-m ./models/Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--host 127.0.0.1 --port 8080 \
-c 8192 \
-ngl 99 \
--parallel 4 \
--cont-batching
# OpenAI uyumlu endpoint:
# http://localhost:8080/v1/chat/completions
# Web UI: http://localhost:8080/-m, --model # GGUF dosya yolu
-c, --ctx-size # context length (model destekliyorsa 32K-128K)
-n, --n-predict # cevap için max token (-1 = sınırsız)
-ngl N # GPU'ya offload edilecek katman (-ngl 99 = hepsi)
-t, --threads # CPU thread sayısı
-tb, --threads-batch # batch işlemde thread
--temp 0.3 # temperature
--top-k 40
--top-p 0.9
--repeat-penalty 1.1
--mlock # modeli RAM'de sabitle (swap engelle)
--mmap # memory-mapped load (büyük model için)
--cont-batching # continuous batching (server'da)
--parallel N # eşzamanlı slot
--flash-attn # FlashAttention 2 (CUDA, hızlı)Hardware acceleration
llama.cpp has the broadest hardware support. Metal (Apple), CUDA (NVIDIA), ROCm (AMD), Vulkan (cross-vendor), SYCL (Intel), OpenCL, even Raspberry Pi NEON SIMD.
| Apple Silicon (M1/M2/M3/M4) | Metal — default, unified memory |
| NVIDIA GPU | CUDA + cuBLAS, FlashAttention 2 |
| AMD GPU | ROCm (Linux) veya Vulkan (cross-OS) |
| Intel GPU | SYCL (oneAPI) |
| Vulkan | Vendor-agnostic (Linux/Win) |
| CPU only (AVX2/NEON) | ✓ |
| Embedded (RPi, ARM) | ✓ |
Model formats & quantization
GGUF format (llama.cpp's own standard). Users like bartowski and TheBloke on HF publish GGUF variants for every popular model. Each comes in many quantization levels.
| F16 / BF16 | Lossless, largest |
| Q8_0 | Near-FP16 quality |
| Q6_K | Very good, mid size |
| Q5_K_M / Q5_K_S | Good quality, reasonable VRAM |
| Q4_K_M ★ | Recommended — quality/size sweet spot |
| Q3_K_M | Small, acceptable quality |
| Q2_K / IQ2_XS | Tiny — experimental only |
| IQ4_XS / IQ3_S | Next-gen 'i-quants' — lower bit, better quality |
# HF safetensors → GGUF (FP16)
python3 convert_hf_to_gguf.py /path/to/model \
--outfile model-f16.gguf
# Quantize (Q4_K_M)
./build/bin/llama-quantize \
model-f16.gguf model-Q4_K_M.gguf Q4_K_MAPI
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="sk-no-key")
resp = client.chat.completions.create(
model="local",
messages=[{"role": "user", "content": "Hi"}],
)
print(resp.choices[0].message.content)curl http://localhost:8080/completion -d '{
"prompt": "Selam",
"n_predict": 128,
"temperature": 0.3,
"top_k": 40,
"top_p": 0.9
}'Performance
| Single-stream (M2 Max, Q4_K_M 8B) | ~55 tok/s |
| Single-stream (RTX 4090, Q4_K_M 8B + FlashAttn) | ~140 tok/s |
| Concurrent slots (--parallel) | supported but not as efficient as vLLM |
| RPi 5 (Q4 4B) | ~3 tok/s |
| Cold start | 1-5s (mmap + RAM) |
Common pitfalls
- Forgetting -nglIf you have a GPU, add -ngl 99 (all layers on GPU). Without it the model runs on CPU and you'll wonder why it's slow.
- Build flags varyGGML_METAL, GGML_CUDA, GGML_VULKAN — must be enabled at build time. Wrong flag = no acceleration. When downloading prebuilt binaries, pick the right variant.
- Confusing quant namingQ4_0, Q4_1, Q4_K_S, Q4_K_M, IQ4_XS — all 4-bit but different algorithms. Q4_K_M is the safe default; IQ4_XS is newer and better-quality but slower on some CPUs.
- Versioning your own binaryllama.cpp moves fast and the GGUF format occasionally changes. For production, pin binary + GGUF + commit hash together.