AI Dictionary
Local AI
Apple Silicon native

MLX

Apple Silicon-native framework

Overview

MLX is Apple's array framework, released in 2023, designed from scratch for Apple Silicon. A PyTorch/JAX-style Python API + Swift API. Exploits unified memory: no copies between CPU/GPU, kernels share the same memory.

Two layers: low-level `mlx` (numpy-like) and `mlx-lm` (LLM inference + fine-tune). The fastest path to run Llama, Mistral, Qwen etc. on Mac — typically 20-40% faster than llama.cpp.

Installation

# Python 3.9+, sadece Apple Silicon
pip install mlx mlx-lm

# Veya tüm ekosistem
pip install mlx mlx-lm mlx-vlm mlx-data

# Doğrula
python -c "import mlx.core as mx; print(mx.metal.is_available())"
# True dönmeli

Configuration

Run a model (CLI)bash
# mlx-community HF organizasyonu MLX'e dönüştürülmüş
# 200+ model barındırır

# Tek komutla generate
mlx_lm.generate \
  --model mlx-community/Llama-3.1-8B-Instruct-4bit \
  --prompt "Selam, kendini tanıt" \
  --max-tokens 256 \
  --temp 0.3

# OpenAI uyumlu sunucu (port 8080)
mlx_lm.server \
  --model mlx-community/Llama-3.1-8B-Instruct-4bit \
  --port 8080
Python APIPython
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Llama-3.1-8B-Instruct-4bit")

response = generate(
    model, tokenizer,
    prompt="Selam!",
    max_tokens=256,
    temp=0.3,
    verbose=True,  # tps yazar
)
print(response)
LoRA fine-tune (training on Mac!)bash
# JSONL veriyi hazırla (data/train.jsonl, data/valid.jsonl)

mlx_lm.lora \
  --model mlx-community/Llama-3.1-8B-Instruct-4bit \
  --train \
  --data ./data \
  --batch-size 2 \
  --num-layers 16 \
  --iters 600 \
  --lora-rank 8

# Sonra adapter ile inference
mlx_lm.generate \
  --model mlx-community/Llama-3.1-8B-Instruct-4bit \
  --adapter-path ./adapters \
  --prompt "..."

Hardware acceleration

MLX runs only on Apple Silicon — no Intel Mac, NVIDIA, or AMD support. But here it's unmatched: it taps every Apple Silicon GPU core and the Neural Engine through one API. Thanks to unified memory, KV cache never moves between CPU and GPU.

Apple Silicon (M1/M2/M3/M4)Native — Metal Performance Shaders + ANE
Intel Mac
NVIDIA / AMD
iOS / iPadOSYes (Swift API)
Unified memoryUse all RAM for model + KV cache
TrainingYes — LoRA, QLoRA fine-tune runs on Mac

Model formats & quantization

Format: MLX's own safetensors variant. The mlx-community org on Hugging Face hosts 200+ popular models pre-converted to MLX. Quantization: INT4, INT8, FP16.

Convert a model to MLXbash
# HF safetensors → MLX (FP16)
mlx_lm.convert \
  --hf-path meta-llama/Llama-3.1-8B-Instruct \
  --mlx-path ./Llama-3.1-8B-Instruct-mlx

# 4-bit quantize ederek dönüştür
mlx_lm.convert \
  --hf-path meta-llama/Llama-3.1-8B-Instruct \
  --mlx-path ./Llama-3.1-8B-Instruct-4bit \
  -q --q-bits 4 --q-group-size 64

# Hub'a yükle
mlx_lm.convert \
  --hf-path meta-llama/Llama-3.1-8B-Instruct \
  --upload-repo your-username/Llama-3.1-8B-mlx
FP16Lossless, best quality
INT8Good balance of quality/size
INT4 (group size 64)Equivalent to 'Q4', most common
INT4 (group size 32)Slightly higher quality, slightly larger
GGUF

API

OpenAI-compatible (Python)Python
# mlx_lm.server'ı başlattıktan sonra
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="sk-no-key")

resp = client.chat.completions.create(
    model="mlx-community/Llama-3.1-8B-Instruct-4bit",
    messages=[{"role": "user", "content": "Selam"}],
    stream=True,
)
for c in resp:
    print(c.choices[0].delta.content or "", end="", flush=True)
Low-level MLX (numpy-like)Python
import mlx.core as mx

a = mx.array([1, 2, 3, 4])
b = mx.array([10, 20, 30, 40])

c = a * b + a.mean()
mx.eval(c)              # lazy → tetikle
print(c)                # array([12.5, 22.5, ...])

# GPU/CPU device manuel:
mx.set_default_device(mx.gpu)

Performance

M2 Max, Llama-3.1-8B 4bit~75 tok/s
M3 Max, Llama-3.1-8B 4bit~95 tok/s
M3 Ultra, Llama-3.1-70B 4bit~18 tok/s
vs llama.cpp (same hardware)20-40% faster
Training (LoRA, 8B 4bit)~4h for 600 iters on M2 Max
RAM ceiling~75% of unified memory (leave headroom for system)

Common pitfalls

  • Apple Silicon onlyNo Intel Mac, no NVIDIA, no Linux. If you need cross-platform, use llama.cpp or vLLM.
  • Doesn't load GGUFMLX uses its own safetensors format. GGUF files from llama.cpp/Ollama won't load — re-convert with mlx_lm.convert.
  • Unified memory ceiling16 GB Mac runs 8B Q4 (~5 GB) fine but 70B won't fit. macOS gives the GPU ~66% of RAM by default; raise it with sudo sysctl iogpu.wired_limit_mb=N.
  • Lazy evaluation confusionMLX is lazy — without mx.eval() or printing, nothing actually runs. Surprising on day one for PyTorch refugees.

Resources