LM Studio
Download, run, and chat with models in a GUI
Overview
LM Studio is a desktop app that brings local AI to you entirely through a GUI. Browse Hugging Face inside the app, download a model, and run it with one click. No terminal, no Python, no config files.
Under the hood it runs llama.cpp and MLX engines. Provides both a chat UI and an OpenAI-compatible local server — so it's callable from code too. Free, but not open source.
Installation
# 1. https://lmstudio.ai/download adresinden .dmg
# 2. Applications klasörüne sürükle
# 3. Aç
# CLI'ı opsiyonel kur (macOS)
~/.lmstudio/bin/lms bootstrapConfiguration
Most settings live in the GUI: search → download → 'Load' → chat. The power is in the per-model config panel: Context Length, GPU Offload, Temperature, Top-P, Repeat Penalty — all as sliders.
# Yüklü modelleri listele
lms ls
# Model indir
lms get llama-3.1-8b-instruct
# Sunucuyu başlat (default port 1234)
lms server start
# Kullanılan modeli değiştir
lms load llama-3.1-8b-instruct
lms unload --all
# Status
lms statusContext Length # 2048 → 8192 → 32K (model destekliyorsa)
GPU Offload (layers) # Auto → 100% (Apple Silicon'da otomatik)
CPU Threads # CPU çekirdek sayısının yarısı genelde optimal
Eval Batch Size # Prompt processing batch (512 default)
Temperature, Top-P # Sampling
Repeat Penalty # 1.1 default, tekrarı kısar
mlock # Modeli RAM'de sabitle (büyük modeller için)
Flash Attention # CUDA'da hızlandırmaHardware acceleration
Two engines: llama.cpp (GGUF) everywhere; MLX (Apple Silicon) only on M-series Macs. LM Studio detects the hardware and picks the engine for you.
| Apple Silicon (M1/M2/M3/M4) | Metal (llama.cpp) + MLX engine option |
| NVIDIA GPU | CUDA — Flash Attention, full offload |
| AMD GPU | Vulkan (Win/Linux), no ROCm yet |
| Intel Mac | CPU only |
| CPU only | ✓ |
| Multi-GPU | Limited, llama.cpp split support |
Model formats & quantization
GGUF (llama.cpp) and MLX formats. Browse Hugging Face from inside the app, click download. Quantization variants come with friendly labels ('Best for your hardware' badge).
| GGUF | All platforms (llama.cpp engine) |
| MLX | Apple Silicon only, MLX engine |
| Quant picker | GUI shows 'Recommended' badge for the right choice |
| Vision models | LLaVA, Qwen-VL — multimodal chat |
| Embedding models | Yes (Nomic, BGE), /v1/embeddings endpoint |
API
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:1234/v1",
api_key="lm-studio",
)
resp = client.chat.completions.create(
model="llama-3.1-8b-instruct",
messages=[{"role": "user", "content": "Selam!"}],
)
print(resp.choices[0].message.content)emb = client.embeddings.create(
model="nomic-embed-text-v1.5",
input="Bir cümle embed et",
)
print(len(emb.data[0].embedding)) # 768Performance
| Single-stream (M2 Max, MLX 8B Q4) | ~60 tok/s |
| Single-stream (M2 Max, GGUF 8B Q4) | ~50 tok/s |
| Concurrent requests | Very low — single-user focused |
| Cold load | 10-60s |
Common pitfalls
- Not open sourceThe app is closed-source. For production embedding, your own binary, or in-house distribution, prefer Ollama or llama.cpp. Free for personal use.
- Single-user productServer mode supports concurrent requests but isn't built for high throughput. For production use vLLM.
- Disk fills up fastEasy to download 10 models from the GUI; each is 4-40 GB. Move Settings → Models folder to an external drive, delete unused models.
- Engine confusion on AppleThe same model can come as both GGUF and MLX. MLX is usually 20-30% faster but not all models exist in MLX. Watch the badges.