TL;DR
- Instruction tuning is the practice of fine-tuning a pretrained LM on a broad collection of NLP tasks reformatted as natural-language instructions, to elicit zero-shot generalisation.
- Originated with Google's FLAN (Wei et al., 2021), BigScience's T0 (Sanh et al., 2021), and OpenAI's InstructGPT (Ouyang et al., 2022) — the three papers that established it.
- The successor to plain SFT in name only: 'instruction tuning' historically refers to the dataset philosophy (many tasks, instruction-formatted), while SFT refers to the training mechanism.
- Modern instruction-tuned models (Llama Instruct, Mistral Instruct, Qwen Instruct) are produced by SFT on instruction-style datasets followed by preference optimisation.
Origins#
Until 2021, the dominant paradigm for adapting a pretrained language model was task-specific fine-tuning — train one BERT for classification, another for QA, another for NLI. FLAN proposed something different: take a single base model, fine-tune it on a collection of tasks reformatted as natural-language instructions, and evaluate zero-shot on held-out tasks.
The empirical result was striking. A model instruction-tuned on a wide variety of tasks generalised better to unseen tasks than the base model with few-shot prompting. The dataset structure — many tasks, expressed naturally — turned out to be a learning signal in its own right. T0 reproduced the result independently with a different task collection; InstructGPT extended it with human-written instructions and preference data.
Dataset Composition#
An instruction-tuning dataset is characterised by three properties: task diversity, instruction-style formatting, and (in modern practice) chat-style conversational structure.
| Dataset | Year | Scale | Notes |
|---|---|---|---|
| FLAN | 2021 | 62 tasks, 10× templates | First demonstration |
| T0 | 2021 | 171 tasks | BigScience open release |
| Super-NaturalInstructions | 2022 | 1,616 tasks | Crowdsourced from researchers |
| FLAN 2022 | 2022 | 1,836 tasks | Used in PaLM 2 |
| Self-Instruct / Alpaca | 2022-23 | 52K → 1M+ examples | Synthetic instructions |
| OpenAssistant / OASST | 2023 | ~160K conversations | Crowdsourced multi-turn |
| UltraChat / UltraFeedback | 2023 | 1.5M conversations | Synthetic high quality |
| Tulu 3 mix | 2024 | ~1M curated | Modern open recipe |
Instruction Tuning vs SFT vs Alignment#
These three terms are often used loosely. The distinction that matters: 'SFT' names a training mechanism (next-token cross entropy with prompt masking); 'instruction tuning' names a dataset philosophy (broad task coverage in instruction format); 'alignment' names a goal (matching human preferences and safety constraints). A typical modern recipe is 'SFT on an instruction-tuning dataset followed by DPO for alignment.'
Outside the academic literature these terms blur. In practice most teams say 'we instruction-tuned the model' when they mean 'we ran SFT on instruction-format data'. The phrasing is harmless as long as the dataset and method are documented.
What Instruction Tuning Buys You#
- Zero-shot generalisation to unseen task types — the core empirical claim from FLAN.
- Robustness to prompt phrasing — instruction-tuned models follow paraphrased instructions reliably where base models do not.
- A foundation for further alignment — preference-optimisation methods like DPO assume an instruction-following starting point.
- Multi-turn capability when datasets include conversations — modern instruction tuning is almost always chat-shaped.
What It Does Not Buy You#
- Factual reliability — instruction tuning does not reduce hallucination; if anything it can increase confident wrong answers.
- Tool use — function calling and tool routing need dedicated training data and often a separate post-training stage.
- Long-context behaviour — instruction tuning on short examples can degrade long-context performance unless explicitly mixed in.
- Domain expertise — broad instruction tuning improves general capability but cannot substitute for domain-specific data.
When to Run Your Own Instruction Tune#
If you are starting from an off-the-shelf base model, an instruction tune is the first thing you do — without it the model is not usable as an assistant. If you are starting from a published instruct or chat model, run domain-specific SFT on top rather than redoing the broad instruction tune from scratch. Re-instruction-tuning a chat model on a narrow dataset is a common mistake that degrades its general capability.
References
- Finetuned Language Models Are Zero-Shot Learners (FLAN) · arXiv (Wei et al., 2021)
- Multitask Prompted Training Enables Zero-Shot Task Generalization (T0) · arXiv (Sanh et al., 2021)
- Training language models to follow instructions (InstructGPT) · arXiv (Ouyang et al., 2022)