Fine-Tuning

When to fine-tune, and how - SFT, LoRA, QLoRA, DPO, and RLHF without the hype.

Do you actually need to fine-tune?

Fine-tuning is powerful but expensive. In most cases, the right order of operations is:

  1. Prompting - clear instructions, examples, structured output.
  2. Retrieval (RAG) - if the model needs facts it doesn't know.
  3. Tools / function calling - if it needs to take actions or do math.
  4. Fine-tuning - if you still need more: style, format, latency, or task-specific quality.

Good signals that fine-tuning will help: you have thousands of high-quality examples, your prompt is getting long and expensive, a smaller model would be cheaper if it could match quality, or you need a very specific output format the base model keeps missing.

The main techniques

Supervised fine-tuning (SFT)

The most common form. You provide input/output pairs (e.g., a user question and an ideal answer) and train the model to reproduce the outputs. Works best when your dataset is consistent, well-labeled, and reflects the exact behavior you want.

Parameter-efficient fine-tuning: LoRA & QLoRA

Instead of updating all of the model's weights, LoRA (Low-Rank Adaptation) freezes the base model and trains small "adapter" matrices inserted into each layer. Training is dramatically cheaper and you can swap adapters in and out at inference time.

QLoRA goes further: it loads the base model in 4-bit quantized form and trains LoRA adapters on top. This makes it realistic to fine-tune 7B–70B models on a single consumer or prosumer GPU.

Preference optimization: RLHF & DPO

Once a model follows instructions, you often want to shape which answers it prefers. You collect pairs of responses (A vs. B) labeled with a preference, then train the model to favor the preferred one.

Continued pretraining

If you need the model to absorb a lot of new domain text (legal, medical, internal docs), you can continue the original next-token pretraining on that corpus before SFT. This is heavier than SFT and only worth it at real scale.

Choosing the right approach

Goal Start with Why
Answer questions about your docs RAG Keeps knowledge fresh and auditable without retraining.
Consistent output format / style SFT (often LoRA) Format adherence is exactly what SFT is good at.
Cheaper inference for a narrow task SFT on a smaller model Distill a large-model workflow into a smaller specialist.
"Make it prefer answers like this" DPO Directly optimizes preferences without a reward model.
Adopt a whole new domain language Continued pretraining + SFT Teach vocabulary and patterns before task-specific tuning.

Data is the hard part

A minimal fine-tuning workflow

  1. Define the task and write an eval set (20–200 examples).
  2. Measure a strong prompt-only baseline on that eval set.
  3. Collect training data (1k–10k examples for most tasks).
  4. Train a LoRA/QLoRA adapter on a reasonable base model.
  5. Evaluate, compare to baseline, iterate on data.
  6. Deploy with the adapter attached, and monitor in production.
Rule of thumb: if you can't describe the target behavior in a prompt well enough for a human to grade outputs, you're not ready to fine-tune yet.