GUI experience

LM Studio

Download, run, and chat with models in a GUI

Overview

LM Studio is a desktop app that brings local AI to you entirely through a GUI. Browse Hugging Face inside the app, download a model, and run it with one click. No terminal, no Python, no config files.

Under the hood it runs llama.cpp and MLX engines. Provides both a chat UI and an OpenAI-compatible local server — so it's callable from code too. Free, but not open source.

Installation

# 1. https://lmstudio.ai/download adresinden .dmg
# 2. Applications klasörüne sürükle
# 3. Aç

# CLI'ı opsiyonel kur (macOS)
~/.lmstudio/bin/lms bootstrap

Configuration

Most settings live in the GUI: search → download → 'Load' → chat. The power is in the per-model config panel: Context Length, GPU Offload, Temperature, Top-P, Repeat Penalty — all as sliders.

Load a model from CLI (lms)bash

# Yüklü modelleri listele
lms ls

# Model indir
lms get llama-3.1-8b-instruct

# Sunucuyu başlat (default port 1234)
lms server start

# Kullanılan modeli değiştir
lms load llama-3.1-8b-instruct
lms unload --all

# Status
lms status

Key GUI parametersbash

Context Length        # 2048 → 8192 → 32K (model destekliyorsa)
GPU Offload (layers)  # Auto → 100% (Apple Silicon'da otomatik)
CPU Threads           # CPU çekirdek sayısının yarısı genelde optimal
Eval Batch Size       # Prompt processing batch (512 default)
Temperature, Top-P    # Sampling
Repeat Penalty        # 1.1 default, tekrarı kısar
mlock                 # Modeli RAM'de sabitle (büyük modeller için)
Flash Attention       # CUDA'da hızlandırma

Hardware acceleration

Two engines: llama.cpp (GGUF) everywhere; MLX (Apple Silicon) only on M-series Macs. LM Studio detects the hardware and picks the engine for you.

Apple Silicon (M1/M2/M3/M4)	Metal (llama.cpp) + MLX engine option
NVIDIA GPU	CUDA — Flash Attention, full offload
AMD GPU	Vulkan (Win/Linux), no ROCm yet
Intel Mac	CPU only
CPU only	✓
Multi-GPU	Limited, llama.cpp split support

Model formats & quantization

GGUF (llama.cpp) and MLX formats. Browse Hugging Face from inside the app, click download. Quantization variants come with friendly labels ('Best for your hardware' badge).

GGUF	All platforms (llama.cpp engine)
MLX	Apple Silicon only, MLX engine
Quant picker	GUI shows 'Recommended' badge for the right choice
Vision models	LLaVA, Qwen-VL — multimodal chat
Embedding models	Yes (Nomic, BGE), /v1/embeddings endpoint

API

OpenAI-compatible (Python)Python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="lm-studio",
)

resp = client.chat.completions.create(
    model="llama-3.1-8b-instruct",
    messages=[{"role": "user", "content": "Selam!"}],
)
print(resp.choices[0].message.content)

Embedding (Python)Python

emb = client.embeddings.create(
    model="nomic-embed-text-v1.5",
    input="Bir cümle embed et",
)
print(len(emb.data[0].embedding))  # 768

Performance

Single-stream (M2 Max, MLX 8B Q4)	~60 tok/s
Single-stream (M2 Max, GGUF 8B Q4)	~50 tok/s
Concurrent requests	Very low — single-user focused
Cold load	10-60s

Common pitfalls

Not open sourceThe app is closed-source. For production embedding, your own binary, or in-house distribution, prefer Ollama or llama.cpp. Free for personal use.
Single-user productServer mode supports concurrent requests but isn't built for high throughput. For production use vLLM.
Disk fills up fastEasy to download 10 models from the GUI; each is 4-40 GB. Move Settings → Models folder to an external drive, delete unused models.
Engine confusion on AppleThe same model can come as both GGUF and MLX. MLX is usually 20-30% faster but not all models exist in MLX. Watch the badges.

Resources

lmstudio.ai Download Docs lms CLI OpenAI compatibility