Fine-Tuning - gonzalomunoz.ai

Do you actually need to fine-tune?

Fine-tuning is powerful but expensive. In most cases, the right order of operations is:

Prompting - clear instructions, examples, structured output.
Retrieval (RAG) - if the model needs facts it doesn't know.
Tools / function calling - if it needs to take actions or do math.
Fine-tuning - if you still need more: style, format, latency, or task-specific quality.

Good signals that fine-tuning will help: you have thousands of high-quality examples, your prompt is getting long and expensive, a smaller model would be cheaper if it could match quality, or you need a very specific output format the base model keeps missing.

The main techniques

Supervised fine-tuning (SFT)

The most common form. You provide input/output pairs (e.g., a user question and an ideal answer) and train the model to reproduce the outputs. Works best when your dataset is consistent, well-labeled, and reflects the exact behavior you want.

Parameter-efficient fine-tuning: LoRA & QLoRA

Instead of updating all of the model's weights, LoRA (Low-Rank Adaptation) freezes the base model and trains small "adapter" matrices inserted into each layer. Training is dramatically cheaper and you can swap adapters in and out at inference time.

QLoRA goes further: it loads the base model in 4-bit quantized form and trains LoRA adapters on top. This makes it realistic to fine-tune 7B–70B models on a single consumer or prosumer GPU.

Preference optimization: RLHF & DPO

Once a model follows instructions, you often want to shape which answers it prefers. You collect pairs of responses (A vs. B) labeled with a preference, then train the model to favor the preferred one.

RLHF (Reinforcement Learning from Human Feedback) trains a reward model from preferences and optimizes the LLM against it, typically with PPO.
DPO (Direct Preference Optimization) skips the reward model and optimizes directly on preference pairs. Simpler, often competitive.

Continued pretraining

If you need the model to absorb a lot of new domain text (legal, medical, internal docs), you can continue the original next-token pretraining on that corpus before SFT. This is heavier than SFT and only worth it at real scale.

Choosing the right approach

Goal	Start with	Why
Answer questions about your docs	RAG	Keeps knowledge fresh and auditable without retraining.
Consistent output format / style	SFT (often LoRA)	Format adherence is exactly what SFT is good at.
Cheaper inference for a narrow task	SFT on a smaller model	Distill a large-model workflow into a smaller specialist.
"Make it prefer answers like this"	DPO	Directly optimizes preferences without a reward model.
Adopt a whole new domain language	Continued pretraining + SFT	Teach vocabulary and patterns before task-specific tuning.

Data is the hard part

Quality > quantity. A few hundred pristine examples usually beats thousands of noisy ones.
Representative. Your training data should look like real production inputs.
Consistent. If two examples disagree on what "good" looks like, the model learns the disagreement.
Hold out an eval set. Freeze it before training. Judge every run against the same bar.

A minimal fine-tuning workflow

Define the task and write an eval set (20–200 examples).
Measure a strong prompt-only baseline on that eval set.
Collect training data (1k–10k examples for most tasks).
Train a LoRA/QLoRA adapter on a reasonable base model.
Evaluate, compare to baseline, iterate on data.
Deploy with the adapter attached, and monitor in production.

Rule of thumb: if you can't describe the target behavior in a prompt well enough for a human to grade outputs, you're not ready to fine-tune yet.