AI Dictionary
Local AI
Runs everywhere

llama.cpp

C++ inference engine that runs anywhere

Overview

llama.cpp is a pure C/C++ inference engine written by Georgi Gerganov. Zero Python dependency, single static binary, runs anywhere. Tools like Ollama and LM Studio are built on top of it — but you can use it directly for more control.

Creator of the GGUF format. CPU, Metal, CUDA, Vulkan, ROCm, OpenCL, SYCL — supports nearly any hardware. Ideal for embedded, server deployment, or shipping inside your own product.

Installation

# Homebrew (Metal otomatik açık)
brew install llama.cpp

# Veya kaynaktan
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j

./build/bin/llama-cli --version

Configuration

Run a model (CLI)bash
# Hugging Face'den GGUF indir
huggingface-cli download bartowski/Llama-3.1-8B-Instruct-GGUF \
  Llama-3.1-8B-Instruct-Q4_K_M.gguf --local-dir ./models

# Sohbet
./build/bin/llama-cli \
  -m ./models/Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -p "Selam, kendini tanıt." \
  -n 256 \
  -c 8192 \
  -ngl 99    # GPU katmanı (Metal/CUDA için)
Local server (OpenAI-compatible)bash
./build/bin/llama-server \
  -m ./models/Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --host 127.0.0.1 --port 8080 \
  -c 8192 \
  -ngl 99 \
  --parallel 4 \
  --cont-batching

# OpenAI uyumlu endpoint:
# http://localhost:8080/v1/chat/completions
# Web UI:   http://localhost:8080/
Key parametersbash
-m, --model        # GGUF dosya yolu
-c, --ctx-size     # context length (model destekliyorsa 32K-128K)
-n, --n-predict    # cevap için max token (-1 = sınırsız)
-ngl N             # GPU'ya offload edilecek katman (-ngl 99 = hepsi)
-t, --threads      # CPU thread sayısı
-tb, --threads-batch  # batch işlemde thread
--temp 0.3         # temperature
--top-k 40
--top-p 0.9
--repeat-penalty 1.1
--mlock            # modeli RAM'de sabitle (swap engelle)
--mmap             # memory-mapped load (büyük model için)
--cont-batching    # continuous batching (server'da)
--parallel N       # eşzamanlı slot
--flash-attn       # FlashAttention 2 (CUDA, hızlı)

Hardware acceleration

llama.cpp has the broadest hardware support. Metal (Apple), CUDA (NVIDIA), ROCm (AMD), Vulkan (cross-vendor), SYCL (Intel), OpenCL, even Raspberry Pi NEON SIMD.

Apple Silicon (M1/M2/M3/M4)Metal — default, unified memory
NVIDIA GPUCUDA + cuBLAS, FlashAttention 2
AMD GPUROCm (Linux) veya Vulkan (cross-OS)
Intel GPUSYCL (oneAPI)
VulkanVendor-agnostic (Linux/Win)
CPU only (AVX2/NEON)
Embedded (RPi, ARM)

Model formats & quantization

GGUF format (llama.cpp's own standard). Users like bartowski and TheBloke on HF publish GGUF variants for every popular model. Each comes in many quantization levels.

F16 / BF16Lossless, largest
Q8_0Near-FP16 quality
Q6_KVery good, mid size
Q5_K_M / Q5_K_SGood quality, reasonable VRAM
Q4_K_M ★Recommended — quality/size sweet spot
Q3_K_MSmall, acceptable quality
Q2_K / IQ2_XSTiny — experimental only
IQ4_XS / IQ3_SNext-gen 'i-quants' — lower bit, better quality
Convert to GGUF yourselfbash
# HF safetensors → GGUF (FP16)
python3 convert_hf_to_gguf.py /path/to/model \
  --outfile model-f16.gguf

# Quantize (Q4_K_M)
./build/bin/llama-quantize \
  model-f16.gguf model-Q4_K_M.gguf Q4_K_M

API

OpenAI-compatible (Python)Python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="sk-no-key")

resp = client.chat.completions.create(
    model="local",
    messages=[{"role": "user", "content": "Hi"}],
)
print(resp.choices[0].message.content)
Native /completion (curl)bash
curl http://localhost:8080/completion -d '{
  "prompt": "Selam",
  "n_predict": 128,
  "temperature": 0.3,
  "top_k": 40,
  "top_p": 0.9
}'

Performance

Single-stream (M2 Max, Q4_K_M 8B)~55 tok/s
Single-stream (RTX 4090, Q4_K_M 8B + FlashAttn)~140 tok/s
Concurrent slots (--parallel)supported but not as efficient as vLLM
RPi 5 (Q4 4B)~3 tok/s
Cold start1-5s (mmap + RAM)

Common pitfalls

  • Forgetting -nglIf you have a GPU, add -ngl 99 (all layers on GPU). Without it the model runs on CPU and you'll wonder why it's slow.
  • Build flags varyGGML_METAL, GGML_CUDA, GGML_VULKAN — must be enabled at build time. Wrong flag = no acceleration. When downloading prebuilt binaries, pick the right variant.
  • Confusing quant namingQ4_0, Q4_1, Q4_K_S, Q4_K_M, IQ4_XS — all 4-bit but different algorithms. Q4_K_M is the safe default; IQ4_XS is newer and better-quality but slower on some CPUs.
  • Versioning your own binaryllama.cpp moves fast and the GGUF format occasionally changes. For production, pin binary + GGUF + commit hash together.

Resources