LoRA
Low-Rank Adaptation
A parameter-efficient fine-tuning method that freezes the big model and trains only small adapter matrices alongside it.
Fine-tuning a 7B-parameter LLM normally means updating every weight — that's high-end GPUs, gigabytes of VRAM, and weeks of compute. LoRA (2021, Microsoft) cuts that cost roughly 100×.
The idea: freeze the base model (its weights don't change). Add two small matrices to each layer: A (n × r) and B (r × m), with r small (usually 8–64). Their product represents the delta to add to the base weights. Only A and B are trained.
Result: on a 7B model, only ~5M parameters get trained (~0.07%). One consumer GPU (24GB VRAM) is enough. The output adapter is ~50MB and snaps onto the base model at runtime.
Instead of rewriting a book, you stick post-it notes in the margins. The book (base model) stays untouched. The post-its (LoRA adapter) add commentary. Make different post-it sets for different scenarios and swap them in and out of the same book.
You want to specialize Llama-3-8B for Turkish legal documents. Classic fine-tuning: 8× A100, $5000+, 3 days. With LoRA: one RTX 4090, $20 of electricity, 4 hours. Resulting adapter: 80MB.
Next day a different customer needs medical terminology. Same base
model, new LoRA. In production you swap between base + lora_legal
and base + lora_medical on the fly. Storing many variants is cheap.
- Fine-tuning open-source models on a limited GPU budget
- Multiple specialized variants on the same base model
- Fast iteration: try 5 different adapters in a day
- Edge deployment — small adapter files are easy to distribute
- When you need to change fundamental model capabilities (e.g. add a new language) — full fine-tune required
- Closed-source models (GPT, Claude) — LoRA needs weight access
- On very small models — the overhead isn't worth the savings
Misjudging rank
r = 4 is fast but underfits. r = 256 is slow and overfits. Typical sweet spot is r = 8–32. Tune to task complexity.
Naively merging multiple LoRAs
Just adding two LoRAs (legal + medical) creates conflicts. Merge techniques like DARE-TIES exist, but the area is still active research.
Base model coupling
A LoRA adapter only works on the exact base it was trained on. Going Llama-3.1 → 3.2 means re-training.