MLX
Apple Silicon-native framework
Overview
MLX is Apple's array framework, released in 2023, designed from scratch for Apple Silicon. A PyTorch/JAX-style Python API + Swift API. Exploits unified memory: no copies between CPU/GPU, kernels share the same memory.
Two layers: low-level `mlx` (numpy-like) and `mlx-lm` (LLM inference + fine-tune). The fastest path to run Llama, Mistral, Qwen etc. on Mac — typically 20-40% faster than llama.cpp.
Installation
# Python 3.9+, sadece Apple Silicon
pip install mlx mlx-lm
# Veya tüm ekosistem
pip install mlx mlx-lm mlx-vlm mlx-data
# Doğrula
python -c "import mlx.core as mx; print(mx.metal.is_available())"
# True dönmeliConfiguration
# mlx-community HF organizasyonu MLX'e dönüştürülmüş
# 200+ model barındırır
# Tek komutla generate
mlx_lm.generate \
--model mlx-community/Llama-3.1-8B-Instruct-4bit \
--prompt "Selam, kendini tanıt" \
--max-tokens 256 \
--temp 0.3
# OpenAI uyumlu sunucu (port 8080)
mlx_lm.server \
--model mlx-community/Llama-3.1-8B-Instruct-4bit \
--port 8080from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Llama-3.1-8B-Instruct-4bit")
response = generate(
model, tokenizer,
prompt="Selam!",
max_tokens=256,
temp=0.3,
verbose=True, # tps yazar
)
print(response)# JSONL veriyi hazırla (data/train.jsonl, data/valid.jsonl)
mlx_lm.lora \
--model mlx-community/Llama-3.1-8B-Instruct-4bit \
--train \
--data ./data \
--batch-size 2 \
--num-layers 16 \
--iters 600 \
--lora-rank 8
# Sonra adapter ile inference
mlx_lm.generate \
--model mlx-community/Llama-3.1-8B-Instruct-4bit \
--adapter-path ./adapters \
--prompt "..."Hardware acceleration
MLX runs only on Apple Silicon — no Intel Mac, NVIDIA, or AMD support. But here it's unmatched: it taps every Apple Silicon GPU core and the Neural Engine through one API. Thanks to unified memory, KV cache never moves between CPU and GPU.
| Apple Silicon (M1/M2/M3/M4) | Native — Metal Performance Shaders + ANE |
| Intel Mac | — |
| NVIDIA / AMD | — |
| iOS / iPadOS | Yes (Swift API) |
| Unified memory | Use all RAM for model + KV cache |
| Training | Yes — LoRA, QLoRA fine-tune runs on Mac |
Model formats & quantization
Format: MLX's own safetensors variant. The mlx-community org on Hugging Face hosts 200+ popular models pre-converted to MLX. Quantization: INT4, INT8, FP16.
# HF safetensors → MLX (FP16)
mlx_lm.convert \
--hf-path meta-llama/Llama-3.1-8B-Instruct \
--mlx-path ./Llama-3.1-8B-Instruct-mlx
# 4-bit quantize ederek dönüştür
mlx_lm.convert \
--hf-path meta-llama/Llama-3.1-8B-Instruct \
--mlx-path ./Llama-3.1-8B-Instruct-4bit \
-q --q-bits 4 --q-group-size 64
# Hub'a yükle
mlx_lm.convert \
--hf-path meta-llama/Llama-3.1-8B-Instruct \
--upload-repo your-username/Llama-3.1-8B-mlx| FP16 | Lossless, best quality |
| INT8 | Good balance of quality/size |
| INT4 (group size 64) | Equivalent to 'Q4', most common |
| INT4 (group size 32) | Slightly higher quality, slightly larger |
| GGUF | — |
API
# mlx_lm.server'ı başlattıktan sonra
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="sk-no-key")
resp = client.chat.completions.create(
model="mlx-community/Llama-3.1-8B-Instruct-4bit",
messages=[{"role": "user", "content": "Selam"}],
stream=True,
)
for c in resp:
print(c.choices[0].delta.content or "", end="", flush=True)import mlx.core as mx
a = mx.array([1, 2, 3, 4])
b = mx.array([10, 20, 30, 40])
c = a * b + a.mean()
mx.eval(c) # lazy → tetikle
print(c) # array([12.5, 22.5, ...])
# GPU/CPU device manuel:
mx.set_default_device(mx.gpu)Performance
| M2 Max, Llama-3.1-8B 4bit | ~75 tok/s |
| M3 Max, Llama-3.1-8B 4bit | ~95 tok/s |
| M3 Ultra, Llama-3.1-70B 4bit | ~18 tok/s |
| vs llama.cpp (same hardware) | 20-40% faster |
| Training (LoRA, 8B 4bit) | ~4h for 600 iters on M2 Max |
| RAM ceiling | ~75% of unified memory (leave headroom for system) |
Common pitfalls
- Apple Silicon onlyNo Intel Mac, no NVIDIA, no Linux. If you need cross-platform, use llama.cpp or vLLM.
- Doesn't load GGUFMLX uses its own safetensors format. GGUF files from llama.cpp/Ollama won't load — re-convert with mlx_lm.convert.
- Unified memory ceiling16 GB Mac runs 8B Q4 (~5 GB) fine but 70B won't fit. macOS gives the GPU ~66% of RAM by default; raise it with sudo sysctl iogpu.wired_limit_mb=N.
- Lazy evaluation confusionMLX is lazy — without mx.eval() or printing, nothing actually runs. Surprising on day one for PyTorch refugees.