AI Dictionary
Local AI
Production leader

vLLM

High-throughput production server

Overview

vLLM is a production-grade LLM inference server out of UC Berkeley. Famous for two innovations: PagedAttention (paged KV cache management that nearly eliminates memory fragmentation) and continuous batching (rolling requests into one mega-batch, boosting throughput 5-10×).

Loads HF Transformers models directly and exposes an OpenAI-compatible API. No Apple Silicon — NVIDIA (CUDA) or AMD (ROCm) GPUs only. Multi-GPU + tensor parallelism out of the box.

Installation

# Python 3.9-3.12, CUDA 12.1+
python -m venv .venv && source .venv/bin/activate

# pip ile (en hızlı)
pip install vllm

# uv ile (daha hızlı, önerilen)
uv pip install vllm

# Doğrula
vllm --version
nvidia-smi  # GPU görünür olmalı

No native macOS support. To try on a Mac, use a Linux VM (UTM, OrbStack) or a remote GPU (RunPod, Lambda).

Configuration

Start the serverbash
# Tek GPU
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --dtype auto

# Multi-GPU (tensor parallel — 2× A100 ile 70B)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

# Quantized (AWQ ile 70B tek GPU'ya sığar)
vllm serve casperhansen/llama-3-70b-instruct-awq \
  --quantization awq \
  --max-model-len 8192
Key parametersbash
--max-model-len 8192          # context window (KV cache rezervasyonu)
--max-num-seqs 256            # eşzamanlı sekans sayısı (batch genişliği)
--max-num-batched-tokens 8192 # bir batch'teki max token
--gpu-memory-utilization 0.90 # VRAM kullanım oranı (0.9 = %90)
--swap-space 4                # CPU swap (GB) — KV cache için
--enforce-eager               # CUDA graph kapatır (debug için)
--enable-prefix-caching       # tekrarlanan prefix'leri cache'le
--enable-chunked-prefill      # uzun prompt'ları parçala (latency düşer)
--quantization awq|gptq|fp8   # quantize edilmiş modeli yükle
--dtype bfloat16|float16|auto # ağırlık tipi
Speculative decoding (2-3× speedup)bash
# Küçük "taslak" model + büyük doğrulayıcı
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --speculative-model meta-llama/Llama-3.2-1B-Instruct \
  --num-speculative-tokens 5 \
  --tensor-parallel-size 2

Hardware acceleration

vLLM is GPU-only in practice — no usable CPU fallback. On NVIDIA: CUDA 11.8+ recommended; best on H100/A100, but RTX 4090/3090 work. On AMD: ROCm 6.0+ with datacenter GPUs like MI300X — RDNA consumer GPUs are unstable.

NVIDIACUDA 11.8+ — Compute capability ≥ 7.0 (Volta+)
AMDROCm 6.0+ (MI300X, MI250)
Apple Silicon (M1/M2/M3/M4)
CPU only
Multi-GPUTensor parallelism (TP) + pipeline parallelism (PP)
Multi-nodeVia Ray (production cluster)

Model formats & quantization

Format: Hugging Face Transformers (safetensors). Pulls models by HF Hub slug (--model org/model) or accepts a local path. Different quantization formats — does not use GGUF.

FP16 / BF16Default, highest quality, most VRAM
FP8Requires H100+ hardware, ~50% VRAM savings
AWQ (Activation-aware Weight Quant)INT4, preserves accuracy, ~4× shrink
GPTQINT4, slightly faster than AWQ, similar quality
BitsAndBytes (NF4)INT4, most common, experimental support
GGUF

Practical: AWQ or FP8 are the production picks. Llama-3.1-70B-AWQ fits on one A100 80GB; the FP16 version needs 2× A100.

API

OpenAI-compatible (Python)Python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",  # vLLM auth varsayılan kapalı
)

resp = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Selam!"}],
    max_tokens=256,
    temperature=0.3,
)
print(resp.choices[0].message.content)
Batch offline inference (Python)Python
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tensor_parallel_size=1,
    max_model_len=4096,
)

prompts = ["Sorgu 1", "Sorgu 2", "Sorgu 3"]
params = SamplingParams(temperature=0.3, max_tokens=200)

# Tek seferde 1000 prompt'u batch'le → 5-10× hızlı
outputs = llm.generate(prompts, params)
for o in outputs:
    print(o.outputs[0].text)

Performance

Single-request tps (1× A100, Llama-3.1-8B FP16)~95 tok/s
Concurrent throughput (32 streams, same setup)~2400 tok/s (toplam)
Throughput vs Ollama (32 streams)5-10×
VRAM efficiency (PagedAttention)~25% less fragmentation vs classic
Cold start30-90s (model load + CUDA graph)
Prefix cachingSystem prompt cache → up to 70% token savings

Common pitfalls

  • max-model-len = VRAM bloatA high max-model-len pre-allocates KV cache memory. Don't set 128K and wonder where VRAM went. Match your real need; turn on prefix caching.
  • GGUF is not supportedvLLM loads HF safetensors. The GGUF you used in Ollama/llama.cpp won't load here. Look for AWQ/GPTQ variants on HF.
  • No Apple SiliconTo try on Mac, use a Linux VM or remote GPU (RunPod, Lambda, Modal). Production is Linux + NVIDIA anyway.
  • Auth is off by defaultvllm serve opens port 8000 to everything. Use --api-key, or front it with a reverse proxy + auth. Don't expose it to the internet directly.
  • OOM chain (--enforce-eager)If you OOM during CUDA graph compile, --enforce-eager turns it off (slower but eases memory). Use for debugging, not production.

Resources