Instruction Tuning

TL;DR

Instruction tuning is the practice of fine-tuning a pretrained LM on a broad collection of NLP tasks reformatted as natural-language instructions, to elicit zero-shot generalisation.
Originated with Google's FLAN (Wei et al., 2021), BigScience's T0 (Sanh et al., 2021), and OpenAI's InstructGPT (Ouyang et al., 2022) — the three papers that established it.
The successor to plain SFT in name only: 'instruction tuning' historically refers to the dataset philosophy (many tasks, instruction-formatted), while SFT refers to the training mechanism.
Modern instruction-tuned models (Llama Instruct, Mistral Instruct, Qwen Instruct) are produced by SFT on instruction-style datasets followed by preference optimisation.

Origins#

Until 2021, the dominant paradigm for adapting a pretrained language model was task-specific fine-tuning — train one BERT for classification, another for QA, another for NLI. FLAN proposed something different: take a single base model, fine-tune it on a collection of tasks reformatted as natural-language instructions, and evaluate zero-shot on held-out tasks.

The empirical result was striking. A model instruction-tuned on a wide variety of tasks generalised better to unseen tasks than the base model with few-shot prompting. The dataset structure — many tasks, expressed naturally — turned out to be a learning signal in its own right. T0 reproduced the result independently with a different task collection; InstructGPT extended it with human-written instructions and preference data.

Dataset Composition#

An instruction-tuning dataset is characterised by three properties: task diversity, instruction-style formatting, and (in modern practice) chat-style conversational structure.

Dataset	Year	Scale	Notes
FLAN	2021	62 tasks, 10× templates	First demonstration
T0	2021	171 tasks	BigScience open release
Super-NaturalInstructions	2022	1,616 tasks	Crowdsourced from researchers
FLAN 2022	2022	1,836 tasks	Used in PaLM 2
Self-Instruct / Alpaca	2022-23	52K → 1M+ examples	Synthetic instructions
OpenAssistant / OASST	2023	~160K conversations	Crowdsourced multi-turn
UltraChat / UltraFeedback	2023	1.5M conversations	Synthetic high quality
Tulu 3 mix	2024	~1M curated	Modern open recipe

Instruction Tuning vs SFT vs Alignment#

These three terms are often used loosely. The distinction that matters: 'SFT' names a training mechanism (next-token cross entropy with prompt masking); 'instruction tuning' names a dataset philosophy (broad task coverage in instruction format); 'alignment' names a goal (matching human preferences and safety constraints). A typical modern recipe is 'SFT on an instruction-tuning dataset followed by DPO for alignment.'

Outside the academic literature these terms blur. In practice most teams say 'we instruction-tuned the model' when they mean 'we ran SFT on instruction-format data'. The phrasing is harmless as long as the dataset and method are documented.

What Instruction Tuning Buys You#

Zero-shot generalisation to unseen task types — the core empirical claim from FLAN.
Robustness to prompt phrasing — instruction-tuned models follow paraphrased instructions reliably where base models do not.
A foundation for further alignment — preference-optimisation methods like DPO assume an instruction-following starting point.
Multi-turn capability when datasets include conversations — modern instruction tuning is almost always chat-shaped.

What It Does Not Buy You#

Factual reliability — instruction tuning does not reduce hallucination; if anything it can increase confident wrong answers.
Tool use — function calling and tool routing need dedicated training data and often a separate post-training stage.
Long-context behaviour — instruction tuning on short examples can degrade long-context performance unless explicitly mixed in.
Domain expertise — broad instruction tuning improves general capability but cannot substitute for domain-specific data.

When to Run Your Own Instruction Tune#

If you are starting from an off-the-shelf base model, an instruction tune is the first thing you do — without it the model is not usable as an assistant. If you are starting from a published instruct or chat model, run domain-specific SFT on top rather than redoing the broad instruction tune from scratch. Re-instruction-tuning a chat model on a narrow dataset is a common mistake that degrades its general capability.

References

Finetuned Language Models Are Zero-Shot Learners (FLAN) · arXiv (Wei et al., 2021)
Multitask Prompted Training Enables Zero-Shot Task Generalization (T0) · arXiv (Sanh et al., 2021)
Training language models to follow instructions (InstructGPT) · arXiv (Ouyang et al., 2022)

Origins#

Dataset Composition#

An instruction-tuning dataset is characterised by three properties: task diversity, instruction-style formatting, and (in modern practice) chat-style conversational structure.

Dataset	Year	Scale	Notes
FLAN	2021	62 tasks, 10× templates	First demonstration
T0	2021	171 tasks	BigScience open release
Super-NaturalInstructions	2022	1,616 tasks	Crowdsourced from researchers
FLAN 2022	2022	1,836 tasks	Used in PaLM 2
Self-Instruct / Alpaca	2022-23	52K → 1M+ examples	Synthetic instructions
OpenAssistant / OASST	2023	~160K conversations	Crowdsourced multi-turn
UltraChat / UltraFeedback	2023	1.5M conversations	Synthetic high quality
Tulu 3 mix	2024	~1M curated	Modern open recipe

Instruction Tuning vs SFT vs Alignment#

What Instruction Tuning Buys You#

Zero-shot generalisation to unseen task types — the core empirical claim from FLAN.

Robustness to prompt phrasing — instruction-tuned models follow paraphrased instructions reliably where base models do not.

A foundation for further alignment — preference-optimisation methods like DPO assume an instruction-following starting point.

Multi-turn capability when datasets include conversations — modern instruction tuning is almost always chat-shaped.

What It Does Not Buy You#

Factual reliability — instruction tuning does not reduce hallucination; if anything it can increase confident wrong answers.

Tool use — function calling and tool routing need dedicated training data and often a separate post-training stage.

Long-context behaviour — instruction tuning on short examples can degrade long-context performance unless explicitly mixed in.

Domain expertise — broad instruction tuning improves general capability but cannot substitute for domain-specific data.

When to Run Your Own Instruction Tune#

Instruction Tuning

Origins#

Dataset Composition#

Instruction Tuning vs SFT vs Alignment#

What Instruction Tuning Buys You#

What It Does Not Buy You#

When to Run Your Own Instruction Tune#

References

Browse all entries

Deploy on Yobitel

Instruction Tuning

Origins#

Dataset Composition#

Instruction Tuning vs SFT vs Alignment#

What Instruction Tuning Buys You#

What It Does Not Buy You#

When to Run Your Own Instruction Tune#

References

Browse all entries

Deploy on Yobitel