vLLM
High-throughput production server
Overview
vLLM is a production-grade LLM inference server out of UC Berkeley. Famous for two innovations: PagedAttention (paged KV cache management that nearly eliminates memory fragmentation) and continuous batching (rolling requests into one mega-batch, boosting throughput 5-10×).
Loads HF Transformers models directly and exposes an OpenAI-compatible API. No Apple Silicon — NVIDIA (CUDA) or AMD (ROCm) GPUs only. Multi-GPU + tensor parallelism out of the box.
Installation
# Python 3.9-3.12, CUDA 12.1+
python -m venv .venv && source .venv/bin/activate
# pip ile (en hızlı)
pip install vllm
# uv ile (daha hızlı, önerilen)
uv pip install vllm
# Doğrula
vllm --version
nvidia-smi # GPU görünür olmalıNo native macOS support. To try on a Mac, use a Linux VM (UTM, OrbStack) or a remote GPU (RunPod, Lambda).
Configuration
# Tek GPU
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--dtype auto
# Multi-GPU (tensor parallel — 2× A100 ile 70B)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90
# Quantized (AWQ ile 70B tek GPU'ya sığar)
vllm serve casperhansen/llama-3-70b-instruct-awq \
--quantization awq \
--max-model-len 8192--max-model-len 8192 # context window (KV cache rezervasyonu)
--max-num-seqs 256 # eşzamanlı sekans sayısı (batch genişliği)
--max-num-batched-tokens 8192 # bir batch'teki max token
--gpu-memory-utilization 0.90 # VRAM kullanım oranı (0.9 = %90)
--swap-space 4 # CPU swap (GB) — KV cache için
--enforce-eager # CUDA graph kapatır (debug için)
--enable-prefix-caching # tekrarlanan prefix'leri cache'le
--enable-chunked-prefill # uzun prompt'ları parçala (latency düşer)
--quantization awq|gptq|fp8 # quantize edilmiş modeli yükle
--dtype bfloat16|float16|auto # ağırlık tipi# Küçük "taslak" model + büyük doğrulayıcı
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--speculative-model meta-llama/Llama-3.2-1B-Instruct \
--num-speculative-tokens 5 \
--tensor-parallel-size 2Hardware acceleration
vLLM is GPU-only in practice — no usable CPU fallback. On NVIDIA: CUDA 11.8+ recommended; best on H100/A100, but RTX 4090/3090 work. On AMD: ROCm 6.0+ with datacenter GPUs like MI300X — RDNA consumer GPUs are unstable.
| NVIDIA | CUDA 11.8+ — Compute capability ≥ 7.0 (Volta+) |
| AMD | ROCm 6.0+ (MI300X, MI250) |
| Apple Silicon (M1/M2/M3/M4) | — |
| CPU only | — |
| Multi-GPU | Tensor parallelism (TP) + pipeline parallelism (PP) |
| Multi-node | Via Ray (production cluster) |
Model formats & quantization
Format: Hugging Face Transformers (safetensors). Pulls models by HF Hub slug (--model org/model) or accepts a local path. Different quantization formats — does not use GGUF.
| FP16 / BF16 | Default, highest quality, most VRAM |
| FP8 | Requires H100+ hardware, ~50% VRAM savings |
| AWQ (Activation-aware Weight Quant) | INT4, preserves accuracy, ~4× shrink |
| GPTQ | INT4, slightly faster than AWQ, similar quality |
| BitsAndBytes (NF4) | INT4, most common, experimental support |
| GGUF | — |
Practical: AWQ or FP8 are the production picks. Llama-3.1-70B-AWQ fits on one A100 80GB; the FP16 version needs 2× A100.
API
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY", # vLLM auth varsayılan kapalı
)
resp = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Selam!"}],
max_tokens=256,
temperature=0.3,
)
print(resp.choices[0].message.content)from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
tensor_parallel_size=1,
max_model_len=4096,
)
prompts = ["Sorgu 1", "Sorgu 2", "Sorgu 3"]
params = SamplingParams(temperature=0.3, max_tokens=200)
# Tek seferde 1000 prompt'u batch'le → 5-10× hızlı
outputs = llm.generate(prompts, params)
for o in outputs:
print(o.outputs[0].text)Performance
| Single-request tps (1× A100, Llama-3.1-8B FP16) | ~95 tok/s |
| Concurrent throughput (32 streams, same setup) | ~2400 tok/s (toplam) |
| Throughput vs Ollama (32 streams) | 5-10× |
| VRAM efficiency (PagedAttention) | ~25% less fragmentation vs classic |
| Cold start | 30-90s (model load + CUDA graph) |
| Prefix caching | System prompt cache → up to 70% token savings |
Common pitfalls
- max-model-len = VRAM bloatA high max-model-len pre-allocates KV cache memory. Don't set 128K and wonder where VRAM went. Match your real need; turn on prefix caching.
- GGUF is not supportedvLLM loads HF safetensors. The GGUF you used in Ollama/llama.cpp won't load here. Look for AWQ/GPTQ variants on HF.
- No Apple SiliconTo try on Mac, use a Linux VM or remote GPU (RunPod, Lambda, Modal). Production is Linux + NVIDIA anyway.
- Auth is off by defaultvllm serve opens port 8000 to everything. Use --api-key, or front it with a reverse proxy + auth. Don't expose it to the internet directly.
- OOM chain (--enforce-eager)If you OOM during CUDA graph compile, --enforce-eager turns it off (slower but eases memory). Use for debugging, not production.