When to fine-tune, and how - SFT, LoRA, QLoRA, DPO, and RLHF without the hype.
Fine-tuning is powerful but expensive. In most cases, the right order of operations is:
Good signals that fine-tuning will help: you have thousands of high-quality examples, your prompt is getting long and expensive, a smaller model would be cheaper if it could match quality, or you need a very specific output format the base model keeps missing.
The most common form. You provide input/output pairs (e.g., a user question and an ideal answer) and train the model to reproduce the outputs. Works best when your dataset is consistent, well-labeled, and reflects the exact behavior you want.
Instead of updating all of the model's weights, LoRA (Low-Rank Adaptation) freezes the base model and trains small "adapter" matrices inserted into each layer. Training is dramatically cheaper and you can swap adapters in and out at inference time.
QLoRA goes further: it loads the base model in 4-bit quantized form and trains LoRA adapters on top. This makes it realistic to fine-tune 7B–70B models on a single consumer or prosumer GPU.
Once a model follows instructions, you often want to shape which answers it prefers. You collect pairs of responses (A vs. B) labeled with a preference, then train the model to favor the preferred one.
If you need the model to absorb a lot of new domain text (legal, medical, internal docs), you can continue the original next-token pretraining on that corpus before SFT. This is heavier than SFT and only worth it at real scale.
| Goal | Start with | Why |
|---|---|---|
| Answer questions about your docs | RAG | Keeps knowledge fresh and auditable without retraining. |
| Consistent output format / style | SFT (often LoRA) | Format adherence is exactly what SFT is good at. |
| Cheaper inference for a narrow task | SFT on a smaller model | Distill a large-model workflow into a smaller specialist. |
| "Make it prefer answers like this" | DPO | Directly optimizes preferences without a reward model. |
| Adopt a whole new domain language | Continued pretraining + SFT | Teach vocabulary and patterns before task-specific tuning. |
Rule of thumb: if you can't describe the target behavior in a prompt well enough for a human to grade outputs, you're not ready to fine-tune yet.