Fine-tuning
Specializing a pretrained model
Re-training a pretrained model on your own data, in small scale, to specialize it for a task or domain.
Training a model from scratch costs millions. Instead, take an existing base model (GPT, Llama, Mistral, etc.) and continue training it on a few thousand examples of your own. Result: a model tuned for your domain, more accurate and more consistent than the generic one.
Two main approaches: full fine-tuning (all weights updated — expensive, powerful) and PEFT (parameter-efficient: LoRA, adapter, prefix tuning — only a small slice trains, cheap but still effective).
Don't confuse with RAG: RAG adds knowledge (fetches docs at runtime), fine-tuning shapes behavior (learns tone, format, terminology). They're complementary, not alternatives.
Like hiring a college graduate and giving them two weeks of company onboarding. They already have core skills (base model); you're just teaching them "how we do things here." Training from scratch = raising a baby for 25 years. Fine-tuning is more practical.
An email SaaS classifies customer emails into "invoice, order, support, spam." First tried with GPT-4 prompting — 88% accuracy, but 1M emails/day × $0.02/1K tokens = $20K/month.
Fine-tune route: collect 5K labeled examples, fine-tune GPT-3.5. Result: 94% accuracy (better!), cost $1K/month (20× cheaper). Smaller model + shorter prompt + consistent output.
Investment: 2 weeks + $200 in fine-tuning fees. ROI: paid for itself in the first month.
{"messages":[{"role":"system","content":"You are an email classifier."},{"role":"user","content":"Your invoice is ready, please pay."},{"role":"assistant","content":"invoice"}]}
{"messages":[{"role":"system","content":"You are an email classifier."},{"role":"user","content":"Your order has shipped."},{"role":"assistant","content":"order"}]}
{"messages":[{"role":"system","content":"You are an email classifier."},{"role":"user","content":"One click to make a million!"},{"role":"assistant","content":"spam"}]}from openai import OpenAI
client = OpenAI()
# 1. Upload the training file
training_file = client.files.create(
file=open("emails.jsonl", "rb"),
purpose="fine-tune",
)
# 2. Start the fine-tune job
job = client.fine_tuning.jobs.create(
training_file=training_file.id,
model="gpt-4o-mini-2024-07-18",
hyperparameters={"n_epochs": 3},
)
# 3. Poll status (~10-30 min total)
print(client.fine_tuning.jobs.retrieve(job.id).status)
# 4. When done, model = "ft:gpt-4o-mini:org:emails:abc123"
# Use this model name in your normal API calls.- Capturing a specific tone/style (brand voice)
- Domain-specific terminology (medical, legal, financial)
- Guaranteed structured output (always the same format)
- Cost optimization: a small fine-tuned model can beat a big generic one
- Latency: smaller model is dramatically faster
- To add knowledge (use RAG)
- When you need fast iteration — a prompt change is instant; fine-tuning takes days
- Too little data (< 500 examples) — overfitting is inevitable
- When you want to surpass the base model — fine-tuning's ceiling is the base model
Insufficient examples
Fine-tuning with 50 examples = overfit or no effect. Practical minimum is 500–1000, sweet spot is 5–50K. Synthetic data (using an LLM to generate data) is now a standard practice.
Catastrophic forgetting
Aggressive fine-tuning can wipe out general capabilities. Teaching medical terms can damage English grammar. Use low learning rate, few epochs, mixed dataset.
No versioning
You try 5 versions of a fine-tuned model. Which one fixed what, broke what? No record. Evaluation suite + version tracking is mandatory.