AI Dictionary
Local AI
GUI experience

LM Studio

Download, run, and chat with models in a GUI

Overview

LM Studio is a desktop app that brings local AI to you entirely through a GUI. Browse Hugging Face inside the app, download a model, and run it with one click. No terminal, no Python, no config files.

Under the hood it runs llama.cpp and MLX engines. Provides both a chat UI and an OpenAI-compatible local server — so it's callable from code too. Free, but not open source.

Installation

# 1. https://lmstudio.ai/download adresinden .dmg
# 2. Applications klasörüne sürükle
# 3. Aç

# CLI'ı opsiyonel kur (macOS)
~/.lmstudio/bin/lms bootstrap

Configuration

Most settings live in the GUI: search → download → 'Load' → chat. The power is in the per-model config panel: Context Length, GPU Offload, Temperature, Top-P, Repeat Penalty — all as sliders.

Load a model from CLI (lms)bash
# Yüklü modelleri listele
lms ls

# Model indir
lms get llama-3.1-8b-instruct

# Sunucuyu başlat (default port 1234)
lms server start

# Kullanılan modeli değiştir
lms load llama-3.1-8b-instruct
lms unload --all

# Status
lms status
Key GUI parametersbash
Context Length        # 2048 → 8192 → 32K (model destekliyorsa)
GPU Offload (layers)  # Auto → 100% (Apple Silicon'da otomatik)
CPU Threads           # CPU çekirdek sayısının yarısı genelde optimal
Eval Batch Size       # Prompt processing batch (512 default)
Temperature, Top-P    # Sampling
Repeat Penalty        # 1.1 default, tekrarı kısar
mlock                 # Modeli RAM'de sabitle (büyük modeller için)
Flash Attention       # CUDA'da hızlandırma

Hardware acceleration

Two engines: llama.cpp (GGUF) everywhere; MLX (Apple Silicon) only on M-series Macs. LM Studio detects the hardware and picks the engine for you.

Apple Silicon (M1/M2/M3/M4)Metal (llama.cpp) + MLX engine option
NVIDIA GPUCUDA — Flash Attention, full offload
AMD GPUVulkan (Win/Linux), no ROCm yet
Intel MacCPU only
CPU only
Multi-GPULimited, llama.cpp split support

Model formats & quantization

GGUF (llama.cpp) and MLX formats. Browse Hugging Face from inside the app, click download. Quantization variants come with friendly labels ('Best for your hardware' badge).

GGUFAll platforms (llama.cpp engine)
MLXApple Silicon only, MLX engine
Quant pickerGUI shows 'Recommended' badge for the right choice
Vision modelsLLaVA, Qwen-VL — multimodal chat
Embedding modelsYes (Nomic, BGE), /v1/embeddings endpoint

API

OpenAI-compatible (Python)Python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="lm-studio",
)

resp = client.chat.completions.create(
    model="llama-3.1-8b-instruct",
    messages=[{"role": "user", "content": "Selam!"}],
)
print(resp.choices[0].message.content)
Embedding (Python)Python
emb = client.embeddings.create(
    model="nomic-embed-text-v1.5",
    input="Bir cümle embed et",
)
print(len(emb.data[0].embedding))  # 768

Performance

Single-stream (M2 Max, MLX 8B Q4)~60 tok/s
Single-stream (M2 Max, GGUF 8B Q4)~50 tok/s
Concurrent requestsVery low — single-user focused
Cold load10-60s

Common pitfalls

  • Not open sourceThe app is closed-source. For production embedding, your own binary, or in-house distribution, prefer Ollama or llama.cpp. Free for personal use.
  • Single-user productServer mode supports concurrent requests but isn't built for high throughput. For production use vLLM.
  • Disk fills up fastEasy to download 10 models from the GUI; each is 4-40 GB. Move Settings → Models folder to an external drive, delete unused models.
  • Engine confusion on AppleThe same model can come as both GGUF and MLX. MLX is usually 20-30% faster but not all models exist in MLX. Watch the badges.

Resources