Advanced· ~2 min read#fine-tuning#training

Fine-tuning

Specializing a pretrained model

Re-training a pretrained model on your own data, in small scale, to specialize it for a task or domain.

Definition

Training a model from scratch costs millions. Instead, take an existing base model (GPT, Llama, Mistral, etc.) and continue training it on a few thousand examples of your own. Result: a model tuned for your domain, more accurate and more consistent than the generic one.

Two main approaches: full fine-tuning (all weights updated — expensive, powerful) and PEFT (parameter-efficient: LoRA, adapter, prefix tuning — only a small slice trains, cheap but still effective).

Don't confuse with RAG: RAG adds knowledge (fetches docs at runtime), fine-tuning shapes behavior (learns tone, format, terminology). They're complementary, not alternatives.

Analogy

Like hiring a college graduate and giving them two weeks of company onboarding. They already have core skills (base model); you're just teaching them "how we do things here." Training from scratch = raising a baby for 25 years. Fine-tuning is more practical.

Real-world example

An email SaaS classifies customer emails into "invoice, order, support, spam." First tried with GPT-4 prompting — 88% accuracy, but 1M emails/day × $0.02/1K tokens = $20K/month.

Fine-tune route: collect 5K labeled examples, fine-tune GPT-3.5. Result: 94% accuracy (better!), cost $1K/month (20× cheaper). Smaller model + shorter prompt + consistent output.

Investment: 2 weeks + $200 in fine-tuning fees. ROI: paid for itself in the first month.

Code examples

Training data (JSONL) · one example per lineJSON

{"messages":[{"role":"system","content":"You are an email classifier."},{"role":"user","content":"Your invoice is ready, please pay."},{"role":"assistant","content":"invoice"}]}
{"messages":[{"role":"system","content":"You are an email classifier."},{"role":"user","content":"Your order has shipped."},{"role":"assistant","content":"order"}]}
{"messages":[{"role":"system","content":"You are an email classifier."},{"role":"user","content":"One click to make a million!"},{"role":"assistant","content":"spam"}]}

Kicking off a fine-tune with OpenAIPython

from openai import OpenAI

client = OpenAI()

# 1. Upload the training file
training_file = client.files.create(
    file=open("emails.jsonl", "rb"),
    purpose="fine-tune",
)

# 2. Start the fine-tune job
job = client.fine_tuning.jobs.create(
    training_file=training_file.id,
    model="gpt-4o-mini-2024-07-18",
    hyperparameters={"n_epochs": 3},
)

# 3. Poll status (~10-30 min total)
print(client.fine_tuning.jobs.retrieve(job.id).status)

# 4. When done, model = "ft:gpt-4o-mini:org:emails:abc123"
# Use this model name in your normal API calls.

When to use

Capturing a specific tone/style (brand voice)
Domain-specific terminology (medical, legal, financial)
Guaranteed structured output (always the same format)
Cost optimization: a small fine-tuned model can beat a big generic one
Latency: smaller model is dramatically faster

When not to use

To add knowledge (use RAG)
When you need fast iteration — a prompt change is instant; fine-tuning takes days
Too little data (< 500 examples) — overfitting is inevitable
When you want to surpass the base model — fine-tuning's ceiling is the base model

Common pitfalls

Insufficient examples

Fine-tuning with 50 examples = overfit or no effect. Practical minimum is 500–1000, sweet spot is 5–50K. Synthetic data (using an LLM to generate data) is now a standard practice.

Catastrophic forgetting

Aggressive fine-tuning can wipe out general capabilities. Teaching medical terms can damage English grammar. Use low learning rate, few epochs, mixed dataset.

No versioning

You try 5 versions of a fine-tuned model. Which one fixed what, broke what? No record. Evaluation suite + version tracking is mandatory.