Supervised Fine-Tuning (SFT)

TL;DR

Supervised fine-tuning (SFT) is the standard cross-entropy next-token-prediction training stage applied to curated (prompt, response) pairs, with the loss masked to zero on prompt tokens so gradient flows only through the response — turning a base language model into something that follows instructions and respects a chat template.
It is the first and mandatory stage of every modern instruction-following or chat model: Llama 3.1 Instruct, Mistral Large Instruct, Qwen3-Instruct, Gemma 2 Instruct, GPT-4o, Claude 3.5, every public open-weights chat model in 2026 went through an SFT stage before any preference optimisation (DPO, ORPO, KTO, RLHF) was layered on top.
Three structural variants dominate in 2026: full-parameter SFT (every parameter trainable, highest quality, requires multi-GPU for anything above 13B), LoRA SFT (frozen base + low-rank adapter, the default for 7-70B in single-GPU and single-node settings), and packed-sequence SFT (concatenates short samples up to context length using variable-length attention masks, 2-4x throughput improvement on chat data).
Quality is dominated by data. The LIMA result (Zhou et al., NeurIPS 2023) showed 1,000 carefully chosen examples produce a competitive instruction-following model; Llama 3.1 used roughly 10 million SFT examples but with extensive deduplication and rejection sampling. Diversity, response quality and deduplication beat raw volume by a large margin.
Yobibyte FineTune exposes SFT as the default method on its customer-facing API; preference methods (DPO, ORPO, KTO, GRPO) on the same surface stack on top of the SFT artefact rather than replacing it — the standard 2026 post-training pipeline is SFT first, preference optimisation second, evaluation continuous.

Overview#

A base language model trained only on raw text learns to continue documents — given any prefix it will produce something that plausibly follows in the style of its training distribution. That is not the same as following instructions. A user typing 'Write a Python function that reverses a string' to a base model is as likely to get back a list of similar coding questions, a Stack Overflow-style discussion, or a continuation of what looks like a tutorial preamble as actual working code. The base model is not refusing to help; it has no concept of refusing or helping. It is doing what it was trained to do: complete the document. Supervised fine-tuning is the training stage that teaches the model the document does not continue — instead, the document is a conversation with a specific structure, and the model's role is to produce the response part of that structure.

Mechanically SFT is the simplest possible adaptation: the same cross-entropy next-token loss the model was pretrained with, the same optimiser, the same code path — but applied to curated (prompt, response) pairs rendered into a chat template, with the loss masked to zero on the prompt tokens so the gradient flows only through the response. The model still sees the prompt during the forward pass (the attention mechanism conditions the response on it) but is never asked to predict the prompt tokens. After enough optimiser steps on enough examples, the model learns the implicit rule: when I see the structure '<|user|> ... <|assistant|>', the right thing to produce is whatever shape the training responses had. That implicit rule generalises far beyond the training prompts; this generalisation is what makes SFT the foundation of every instruction-following model.

SFT is also the prerequisite for every preference method. DPO, ORPO, KTO, IPO, CPO, SimPO, GRPO and PPO-style RLHF all assume the model already produces sensible candidate responses; their job is to teach the model to prefer the better candidates among several. A pretrained base model produces noise as candidates and preference methods cannot meaningfully optimise noise. The canonical 2026 post-training pipeline is therefore: (1) SFT to teach format and basic task coverage; (2) preference optimisation to refine response quality; (3) optional safety fine-tuning; (4) continuous evaluation. SFT is never skipped, never replaced.

This entry helps you decide when to run SFT, which variant fits your data and compute, how to set the hyperparameters that actually matter, and how the result interacts with downstream preference methods. The relationship to Yobitel: Yobibyte FineTune's customer-facing API exposes SFT as the default method (`method: sft`), with all three variants (full FT, LoRA, packed-sequence) selectable behind the same job submission. Yobitel NeoCloud rents the H100 and H200 capacity the SFT job runs on. InferenceBench evaluates SFT-trained adapters alongside base models on its public leaderboard so customers can confirm quality lift before rollout.

How it works: the loss, the mask and the optimiser#

The mathematics of SFT is identical to pretraining. The model M predicts a next-token distribution P(t_i | t_<i; M) over the vocabulary at every position. The training loss is the standard negative log-likelihood: L = -sum_i (mask_i * log P(t_i | t_<i; M)) / sum_i mask_i, where mask_i is 1 on response tokens and 0 on prompt tokens. The denominator (sum of mask) normalises per-token rather than per-sequence, which is the convention TRL's SFTTrainer and every modern framework use; it prevents long sequences from disproportionately driving the gradient.

The dataset pipeline. Each training example is a (prompt, response) pair (or a multi-turn (system, [user, assistant]+) conversation). The pair is rendered into the base model's chat template — for Llama 3.1 that means wrapping in `<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n{response}<|eot_id|>`. The tokeniser produces input_ids; a parallel mask records which positions correspond to response tokens. The example is concatenated with others up to the configured sequence length (sample packing), padded if necessary, and shipped to the GPU as a (batch, seq_len) tensor.

The forward pass. The base model sees the full input_ids — prompt and response both — and computes logits at every position. Standard causal attention means each position attends only to itself and prior positions, so the response tokens are conditioned on the prompt but the prompt tokens are not affected by the response. The cross-entropy loss is computed at every position, then the prompt-mask multiplies away the contributions from prompt tokens. The gradient backpropagates through the entire model (full FT) or through the LoRA adapter and frozen base path (LoRA SFT, see the LoRA entry).

The optimiser. AdamW with cosine decay is the default. Full-parameter SFT uses learning rates 1e-5 to 5e-5 (small, because the gradient signal is strong on every parameter and large LR causes catastrophic forgetting). LoRA SFT uses 1e-4 to 5e-4 (larger, because only the small adapter is learning and the base provides a stabilising anchor). Warmup ratio 3-10% of total steps prevents early-step instabilities. Effective batch size 32-256 (via gradient accumulation) is standard; smaller batches are noisy, larger batches are wasteful. Weight decay 0 is the safe default for LoRA, 0.01 for full FT.

Epochs. The single most consequential hyperparameter and the one most often mis-set. One epoch is the right answer for nearly every dataset above 50k examples. Two to three epochs is acceptable for 10-50k examples but the overfitting risk rises. Above three epochs, training loss continues to fall but evaluation quality usually degrades — the model memorises the training responses and loses generalisation. The LIMA paper's 1k-example fine-tune ran for 15 epochs; small dataset regimes are different, but for any dataset that looks like 'production scale', one to three epochs is the recipe.

Loss: standard cross-entropy next-token-prediction. Same as pretraining.
Mask: 1 on response tokens, 0 on prompt tokens. Gradient flows only through response.
Forward pass: full input (prompt + response) goes through the model; causal attention conditions response on prompt.
Optimiser: AdamW; cosine LR decay; warmup 3-10%; effective batch 32-256.
Learning rate: 1e-5 to 5e-5 for full FT; 1e-4 to 5e-4 for LoRA SFT.
Epochs: 1 for >50k examples; 1-3 for 10-50k; up to 15 for sub-1k 'LIMA-style' runs.

python

# sft_minimal.py — illustrative SFT loop with prompt masking.
# Real runs use TRL's SFTTrainer; this shows what it does under the hood.
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Meta-Llama-3.1-8B"
tok = AutoTokenizer.from_pretrained(model_name)
mdl = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="bfloat16", device_map="auto")

def build_example(prompt: str, response: str):
    """Render via chat template, then mark prompt tokens with -100 (ignored by CE loss)."""
    full = tok.apply_chat_template(
        [{"role": "user", "content": prompt}, {"role": "assistant", "content": response}],
        tokenize=True, return_tensors="pt",
    )
    prompt_only = tok.apply_chat_template(
        [{"role": "user", "content": prompt}],
        tokenize=True, return_tensors="pt", add_generation_prompt=True,
    )
    prompt_len = prompt_only.shape[1]
    labels = full.clone()
    labels[:, :prompt_len] = -100        # mask prompt tokens
    return full.to(mdl.device), labels.to(mdl.device)

prompt   = "Write a Python function that reverses a string."
response = "def reverse(s: str) -> str:\n    return s[::-1]"

input_ids, labels = build_example(prompt, response)
out  = mdl(input_ids=input_ids, labels=labels)   # HF auto-applies CE with -100 mask
loss = out.loss                                  # scalar; -100 positions ignored
# loss.backward() ... optimiser.step() ... over thousands of examples = SFT.

Variants and architectural choices#

Three structural variants of SFT dominate the open-source ecosystem in 2026. The variants are orthogonal to the underlying loss — every variant uses the same masked cross-entropy — but they differ in how parameters are updated, how data is packed into batches and how the optimiser state is shaped. Choose the variant by your compute envelope and the size of the base model.

Full-parameter SFT is the highest-quality variant and the canonical recipe behind the public open-weights instruction releases (Llama 3.1 Instruct, Mistral Large Instruct, Qwen3-Instruct, etc.). Every parameter receives a gradient; the optimiser state, gradient buffer and weight memory together exceed 7x the BF16 weight memory at full precision, which is why anything above 13B requires multi-GPU DeepSpeed ZeRO-3 or FSDP. Public open-weights teams run full-parameter SFT on 64-512 H100 clusters; Yobibyte FineTune supports it for customer teams that need maximum quality but defaults to LoRA SFT for the cost-economics reasons covered in the LoRA entry.

LoRA SFT is the practical default. The base is frozen, low-rank A and B matrices are inserted into every linear layer in attention and MLP, and only those receive gradients. Adapter file is 10-500 MB. Optimiser state shrinks ~1,400x. Quality lands within 1-2 points of full FT on standard instruction-tuning evaluations; the gap closes further with DoRA (`use_dora=True` in PEFT). LoRA SFT is what Yobibyte FineTune runs by default for single-GPU and single-node fine-tunes.

Packed-sequence SFT is an orthogonal throughput optimisation. Without packing, every (prompt, response) pair becomes a padded sequence — most positions are padding for short examples, wasting compute. Packing concatenates multiple short examples up to the configured `sequence_len`, separated by document boundaries that the attention mask respects (via FlashAttention's variable-length attention API). Throughput rises 2-4x on chat datasets where the typical example is ~500 tokens but sequence_len is 4k or 8k. Implementations must handle the cross-document attention mask correctly; a buggy packing implementation produces silent quality degradation because tokens attend across document boundaries.

Multi-turn SFT is the variant required for assistant models that hold conversations across many turns. Training data is rendered as conversations: (system, user_1, assistant_1, user_2, assistant_2, ...), and the prompt-mask sets loss = 1 on every assistant turn, loss = 0 on every system and user turn. This trains the model to produce coherent assistant responses conditioned on the entire prior conversation, not just the last user turn. The cost is longer sequence lengths (multi-turn conversations naturally run 2k-8k tokens) and slightly more careful chat-template construction.

Continued-pretraining-style SFT is used when the target domain has very different distribution from pretraining (legal, medical, code, specialised languages). Mechanically it is the same loss but the dataset is raw text or weakly-supervised pairs at very large scale (millions of examples), the learning rate is lower (1e-5 to 5e-6), and the run is mixed with 10-20% original pretraining distribution data to prevent catastrophic forgetting. Standard 'instruction-tuning' SFT is a small subset of this regime.

Variant	Trainable params	Optimiser state	Hardware footprint	Quality vs full FT
Full-parameter SFT	100% of base	Largest — AdamW state per param	Multi-GPU mandatory above 13B (DeepSpeed ZeRO-3 / FSDP)	Baseline — the reference
LoRA SFT (r=16-64)	~0.1-1% of base	~1,000-10,000x smaller	Single-GPU fits 7-13B; H100/H200 fits 70B with QLoRA	95-99% of full FT on instruction tuning; gap closes with DoRA
QLoRA SFT (NF4 base)	Same as LoRA SFT	Same as LoRA SFT	Single-GPU fits up to 70B on H100 80 GB	<0.5 pt below BF16 LoRA SFT on most workloads
Packed-sequence SFT	Same as base method	Same as base method	Same as base method; throughput 2-4x	Identical quality to padded SFT when packing implemented correctly
Multi-turn SFT	Same as base method	Same as base method	Same as base method; sequence_len typically larger	Required for multi-turn chat quality; not optional for assistant models
Continued-pretraining-style SFT	Full FT typically	Largest	Multi-GPU	Used for large domain shifts; standard instruction-tune SFT is a subset

Where it is used today: the post-training pipeline in 2026#

Every modern instruction-following LLM goes through an SFT stage as the first step of post-training. The public open-weights releases document this explicitly: Llama 3.1 Instruct used roughly 10 million SFT examples sourced from human annotation, synthetic generation (Llama 3.1 405B used as a teacher) and rejection sampling. Mistral Large Instruct used a similar pipeline at lower volume but higher curation density. Qwen3-Instruct combined SFT on a multi-task instruction corpus with a substantial mathematics and code emphasis. Gemma 2 Instruct and the Llama 3.2 small-model line followed the same shape.

On the frontier closed-model side, OpenAI, Anthropic and Google all run SFT as the first post-training stage, though the exact data composition is not public. The public 2022 InstructGPT paper (Ouyang et al., arXiv:2203.02155) documented the original SFT-then-RLHF recipe that still dominates frontier post-training; subsequent generations have refined the data sources (more synthetic generation, more rejection sampling, more multi-turn) but the architectural shape — SFT first, preference optimisation second — has not changed.

The customer-facing fine-tune APIs all expose SFT as their headline method. OpenAI's fine-tune API is SFT (LoRA underneath). Anthropic's Claude 3.5 Haiku fine-tune is SFT. Mistral's fine-tune API is SFT. Together AI, Replicate, Fireworks, AWS Bedrock and Vertex AI custom-model offerings are all SFT-first. Yobibyte FineTune exposes SFT under `method: sft` and is the default if no method is specified.

Open-source frameworks. TRL's SFTTrainer is the canonical implementation and the substrate everything else builds on; Axolotl (`adapter: lora` or `adapter: qlora` with no `rl:` field) defaults to SFT; Unsloth's FastLanguageModel paired with TRL's SFTTrainer is the throughput-optimised single-GPU recipe; LLaMA-Factory exposes SFT through its Gradio UI. NVIDIA NeMo, DeepSpeed Examples and Hugging Face Accelerate all provide reference SFT pipelines.

Sizing — what an SFT run actually costs in 2026, in USD at Yobitel NeoCloud reference pricing ($2.60/H100/hr, $3.20/H200/hr).

Workload	Variant	Hardware	Wall time	Cost (NeoCloud)
8B SFT, 10k examples, 1 epoch	LoRA r=32	1x H100	~1 hr	~$2.60
8B SFT, 100k examples, 2 epochs	LoRA r=32	1x H100	~15 hr	~$39
13B SFT, 50k examples, 2 epochs	QLoRA r=32	1x H100	~12 hr	~$31
70B SFT, 50k examples, 1 epoch	QLoRA r=32	1x H200	~30 hr	~$96
70B SFT, 100k examples, 2 epochs	Full FT (ZeRO-3)	8x H100	~16 hr	~$333
70B SFT, 10M examples, 1 epoch	Full FT (ZeRO-3)	64x H100	~5 days	~$20,000

Trade-offs and known limitations#

SFT is the cheapest, simplest, most reliable form of post-training — and is also the limit of what naive next-token prediction can achieve. The trade-offs worth understanding before committing to SFT-only vs SFT-then-preference-optimisation.

Quality ceiling. SFT teaches the model to mimic the response distribution in the training data. It does not teach the model to prefer better responses among several plausible candidates, to refuse cleanly when appropriate, to optimise for tone or factual conservatism beyond what the training distribution exhibits, or to recover gracefully from its own errors. Anything that requires reasoning about response quality rather than reproducing response shape is outside SFT's scope. Layered preference optimisation (DPO, ORPO, KTO, GRPO) addresses these gaps.

Data sensitivity. SFT quality is dominated by data — much more than by hyperparameter choice. A small, clean, diverse, deduplicated dataset reliably outperforms a large, noisy one of the same domain. LIMA (Zhou et al., NeurIPS 2023) showed 1,000 examples could produce a competitive instruction-following model; subsequent work has reproduced the finding. The corollary: most of the engineering effort in producing a high-quality SFT-trained model goes into the data pipeline (deduplication, response quality filtering, diversity sampling, length balancing), not the trainer code.

Catastrophic forgetting. Full-parameter SFT on a narrow domain (e.g. only medical conversations) reliably erases capabilities in adjacent domains (general chat, code, math) because every parameter is updated and the gradient pushes the model toward the narrow distribution. The standard mitigations are: mix 10-30% general-purpose instruction data into the narrow SFT corpus; use LoRA SFT instead of full FT (the frozen base preserves general capability); train fewer epochs; lower learning rate.

Refusal collapse. Overweighting cautious or safety-oriented examples teaches the model to refuse legitimate requests because they superficially resemble refused training cases. The mitigation is dataset-level: balance helpful and refused examples, audit refusal triggers, evaluate on legitimate-but-borderline prompts before shipping.

Chat template fragility. SFT teaches the model to respond to a specific chat-template structure. If serving uses a different template — even subtly different (extra newline, different separator token, different system-prompt wrapping) — quality degrades silently. The mitigation is: pin the chat template in the tokeniser config, use the same template at training and serving, and test with the exact template used by the inference framework.

Tokeniser mismatches. Adding special tokens (`<|im_start|>`, `<|tool_call|>`, etc.) to the chat template post-hoc requires resizing the embedding layer and the LM head, which LoRA does not target by default. The fix is to add `modules_to_save=['embed_tokens','lm_head']` to the LoRA config and fully fine-tune those layers alongside the LoRA adapter.

Pro: simplest, cheapest, most reliable form of post-training.
Pro: prerequisite for every preference method — universally needed.
Pro: scales from 1k LIMA-style runs to 10M+ Llama-scale runs with the same loss.
Pro: works with LoRA, QLoRA, full FT and every framework in the ecosystem.
Con: cannot teach response quality preferences — needs DPO / ORPO / KTO / GRPO on top.
Con: quality dominated by data; trainer choice is secondary.
Con: catastrophic forgetting risk on narrow-domain full FT; mitigated by LoRA or data mixing.
Con: chat-template fragility — train and serve must use identical templates.
Con: tokeniser changes (new special tokens) require resizing embeddings.

Practical implementation notes#

Libraries that implement SFT well in 2026. TRL's SFTTrainer (huggingface/trl) is the canonical implementation — `from trl import SFTTrainer, SFTConfig`, point at a tokenised dataset and the model, and it runs. Axolotl (axolotl-ai-cloud/axolotl) wraps SFTTrainer with YAML config and the validation layer described in the Axolotl entry — recommended for any production SFT workload. Unsloth (unslothai/unsloth) wraps SFTTrainer with custom Triton kernels and is the throughput leader on single-GPU; see the Unsloth entry. LLaMA-Factory (hiyouga/LLaMA-Factory) wraps SFTTrainer with a Gradio UI. PEFT (huggingface/peft) provides the LoRA layer that the LoRA SFT and QLoRA SFT variants build on. Hugging Face Accelerate handles distributed launching across DDP, DeepSpeed and FSDP.

Hyperparameter defaults for instruction-tuning a 7-70B model with LoRA SFT in 2026: LoRA r=32, alpha=64 (alpha=2*r convention), use_rslora=True, target every linear in attention + MLP, dropout 0 (or 0.05 for sub-10k datasets), learning rate 2e-4 with cosine decay to 1e-5 and 3% warmup, effective batch 64-128 via grad accumulation, 1-3 epochs, BF16 throughout, gradient checkpointing on, FlashAttention 2/3 on, sample packing on, NEFTune embedding noise alpha=5-15 (optional, small but consistent chat-fluency bump). For full FT, drop LR to 1e-5, raise weight decay to 0.01, and use DeepSpeed ZeRO-3 above 13B.

Data preparation. The most consequential pipeline step. (1) Source — human-annotated, synthetic (teacher-model-generated and rejection-sampled), or hybrid. (2) Deduplicate — exact-match first, then near-duplicate via MinHash or embedding similarity; near-duplicates inflate training-set size without adding signal. (3) Quality filter — reject responses that are too short, contain template errors, or fail format-specific validation (code that does not parse, math that does not check). (4) Length balance — if the dataset is dominated by short examples, sample packing helps; if it has a long-tail of very long examples, cap sequence length to avoid disproportionate optimiser-step weight. (5) Diversity sample — for very large datasets, downsample over-represented task types and domains so the trained model generalises further. The LIMA paper's central finding is that this pipeline dominates everything else; a clean 10k-example dataset will produce a better SFT model than a noisy 1M-example one.

Common failure modes. (1) Training loss falls steadily but eval quality is poor: usually `train_on_inputs: true` left on by default, so loss flows on prompt tokens. Confirm prompt masking is active. (2) Loss plateaus from step 0: wrong chat template, wrong tokeniser, or `lora_target_modules` does not match any actual module name (silently produces a zero adapter). Print `model.print_trainable_parameters()` to confirm. (3) Eval quality drops after epoch 1: overfitting. Drop to 1 epoch. (4) Chat quality fine in eval but broken in production: chat template mismatch between training and serving. Pin and verify. (5) Refusal rate too high: refusal-collapse from over-weighted safety data. Audit dataset composition. (6) Special-token outputs missing: tokeniser not resized; add `modules_to_save: ['embed_tokens','lm_head']`.

How Yobitel customers consume this. The Yobibyte FineTune resource on the Yobibyte platform exposes SFT as `method: sft` (the default if no method is specified) on its customer-facing API. Customers submit a job spec — base model, dataset reference, rank, learning rate, epochs, spend cap — and Yobibyte runs the equivalent SFT pipeline using Axolotl + Unsloth on the supported model families on Yobitel-managed H100 or H200 capacity in NCSC OFFICIAL-aligned UK and EU NeoCloud regions. The resulting adapter is registered with Yobibyte's multi-LoRA inference surface and callable through an OpenAI-compatible endpoint within minutes of training completing. Preference methods (`method: dpo`, `method: orpo`, etc.) on the same surface stack on top of the SFT artefact rather than replacing it; the customer is encouraged to run SFT first and a preference stage second when their dataset supports it.

If you have only (prompt, response) pairs, you can only run SFT. If you have (prompt, chosen, rejected) triples or scalar preference scores, run SFT first and DPO / ORPO / KTO second on top of the SFT artefact. The two-stage pipeline (SFT then preference) consistently outperforms either stage alone on every public benchmark.

Where SFT fits in the Yobitel stack#

SFT is the default method on Yobibyte FineTune — the customer-facing fine-tune resource on the Yobibyte platform. The API treats `method: sft` as the default if no method is specified; the customer supplies a base, dataset reference and a handful of hyperparameters (rank, learning rate, epochs, spend cap) and receives an adapter directory plus an OpenAI-compatible endpoint backed by Yobibyte's multi-LoRA inference surface. Internally Yobibyte routes the job to Unsloth + TRL on single-GPU or Axolotl + DeepSpeed on multi-GPU, running on Yobitel-managed H100 or H200 capacity in NCSC OFFICIAL-aligned UK and EU NeoCloud regions.

Preference methods on the same Yobibyte FineTune surface (`method: dpo`, `method: orpo`, `method: kto`, `method: grpo`) stack on top of the SFT artefact rather than replacing it. The standard 2026 pipeline a customer follows on Yobibyte is: (1) submit an SFT job to produce a v1 adapter; (2) evaluate the v1 adapter on InferenceBench or a custom evaluation set; (3) curate a preference dataset of (prompt, chosen, rejected) triples — often by generating multiple candidates from the v1 adapter and judging them with a strong reference model; (4) submit a DPO / ORPO / KTO job pointed at the v1 adapter to produce a v2 adapter that incorporates preference signal. Each stage is billed per training token; the v2 adapter is hot-swap-served alongside the v1 from the same multi-LoRA replica.

For self-hosted teams, Yobitel NeoCloud rents the H100 and H200 SXM5 capacity the SFT job would otherwise run on; the same Axolotl YAML or Unsloth script runs identically on rented NeoCloud GPUs with the same NCSC OFFICIAL alignment the managed Yobibyte service uses. InferenceBench, Yobitel's public AI-model benchmark, evaluates SFT-trained adapters alongside the bases they derived from so customers can quantify the quality lift their fine-tune produced before committing it to production rollout — the data confirms the textbook claim that good SFT lifts task-specific performance well above the base while preserving general capability.

References

Training language models to follow instructions with human feedback (InstructGPT, Ouyang et al., 2022) · arXiv
LIMA: Less Is More for Alignment (Zhou et al., 2023) · arXiv / NeurIPS 2023
The Llama 3 Herd of Models · arXiv (Meta AI, 2024)
TRL — Transformer Reinforcement Learning (SFTTrainer) · GitHub
Tulu 3: Pushing Frontiers in Open Language Model Post-Training · arXiv (Allen AI, 2024)

TL;DR

Supervised fine-tuning (SFT) is the standard cross-entropy next-token-prediction training stage applied to curated (prompt, response) pairs, with the loss masked to zero on prompt tokens so gradient flows only through the response — turning a base language model into something that follows instructions and respects a chat template.
It is the first and mandatory stage of every modern instruction-following or chat model: Llama 3.1 Instruct, Mistral Large Instruct, Qwen3-Instruct, Gemma 2 Instruct, GPT-4o, Claude 3.5, every public open-weights chat model in 2026 went through an SFT stage before any preference optimisation (DPO, ORPO, KTO, RLHF) was layered on top.
Three structural variants dominate in 2026: full-parameter SFT (every parameter trainable, highest quality, requires multi-GPU for anything above 13B), LoRA SFT (frozen base + low-rank adapter, the default for 7-70B in single-GPU and single-node settings), and packed-sequence SFT (concatenates short samples up to context length using variable-length attention masks, 2-4x throughput improvement on chat data).
Quality is dominated by data. The LIMA result (Zhou et al., NeurIPS 2023) showed 1,000 carefully chosen examples produce a competitive instruction-following model; Llama 3.1 used roughly 10 million SFT examples but with extensive deduplication and rejection sampling. Diversity, response quality and deduplication beat raw volume by a large margin.
Yobibyte FineTune exposes SFT as the default method on its customer-facing API; preference methods (DPO, ORPO, KTO, GRPO) on the same surface stack on top of the SFT artefact rather than replacing it — the standard 2026 post-training pipeline is SFT first, preference optimisation second, evaluation continuous.

Overview#

How it works: the loss, the mask and the optimiser#

Loss: standard cross-entropy next-token-prediction. Same as pretraining.
Mask: 1 on response tokens, 0 on prompt tokens. Gradient flows only through response.
Forward pass: full input (prompt + response) goes through the model; causal attention conditions response on prompt.
Optimiser: AdamW; cosine LR decay; warmup 3-10%; effective batch 32-256.
Learning rate: 1e-5 to 5e-5 for full FT; 1e-4 to 5e-4 for LoRA SFT.
Epochs: 1 for >50k examples; 1-3 for 10-50k; up to 15 for sub-1k 'LIMA-style' runs.

python

# sft_minimal.py — illustrative SFT loop with prompt masking.
# Real runs use TRL's SFTTrainer; this shows what it does under the hood.
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Meta-Llama-3.1-8B"
tok = AutoTokenizer.from_pretrained(model_name)
mdl = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="bfloat16", device_map="auto")

def build_example(prompt: str, response: str):
    """Render via chat template, then mark prompt tokens with -100 (ignored by CE loss)."""
    full = tok.apply_chat_template(
        [{"role": "user", "content": prompt}, {"role": "assistant", "content": response}],
        tokenize=True, return_tensors="pt",
    )
    prompt_only = tok.apply_chat_template(
        [{"role": "user", "content": prompt}],
        tokenize=True, return_tensors="pt", add_generation_prompt=True,
    )
    prompt_len = prompt_only.shape[1]
    labels = full.clone()
    labels[:, :prompt_len] = -100        # mask prompt tokens
    return full.to(mdl.device), labels.to(mdl.device)

prompt   = "Write a Python function that reverses a string."
response = "def reverse(s: str) -> str:\n    return s[::-1]"

input_ids, labels = build_example(prompt, response)
out  = mdl(input_ids=input_ids, labels=labels)   # HF auto-applies CE with -100 mask
loss = out.loss                                  # scalar; -100 positions ignored
# loss.backward() ... optimiser.step() ... over thousands of examples = SFT.

Variants and architectural choices#

Variant	Trainable params	Optimiser state	Hardware footprint	Quality vs full FT
Full-parameter SFT	100% of base	Largest — AdamW state per param	Multi-GPU mandatory above 13B (DeepSpeed ZeRO-3 / FSDP)	Baseline — the reference
LoRA SFT (r=16-64)	~0.1-1% of base	~1,000-10,000x smaller	Single-GPU fits 7-13B; H100/H200 fits 70B with QLoRA	95-99% of full FT on instruction tuning; gap closes with DoRA
QLoRA SFT (NF4 base)	Same as LoRA SFT	Same as LoRA SFT	Single-GPU fits up to 70B on H100 80 GB	<0.5 pt below BF16 LoRA SFT on most workloads
Packed-sequence SFT	Same as base method	Same as base method	Same as base method; throughput 2-4x	Identical quality to padded SFT when packing implemented correctly
Multi-turn SFT	Same as base method	Same as base method	Same as base method; sequence_len typically larger	Required for multi-turn chat quality; not optional for assistant models
Continued-pretraining-style SFT	Full FT typically	Largest	Multi-GPU	Used for large domain shifts; standard instruction-tune SFT is a subset

Where it is used today: the post-training pipeline in 2026#

Sizing — what an SFT run actually costs in 2026, in USD at Yobitel NeoCloud reference pricing ($2.60/H100/hr, $3.20/H200/hr).

Workload	Variant	Hardware	Wall time	Cost (NeoCloud)
8B SFT, 10k examples, 1 epoch	LoRA r=32	1x H100	~1 hr	~$2.60
8B SFT, 100k examples, 2 epochs	LoRA r=32	1x H100	~15 hr	~$39
13B SFT, 50k examples, 2 epochs	QLoRA r=32	1x H100	~12 hr	~$31
70B SFT, 50k examples, 1 epoch	QLoRA r=32	1x H200	~30 hr	~$96
70B SFT, 100k examples, 2 epochs	Full FT (ZeRO-3)	8x H100	~16 hr	~$333
70B SFT, 10M examples, 1 epoch	Full FT (ZeRO-3)	64x H100	~5 days	~$20,000

Trade-offs and known limitations#

Pro: simplest, cheapest, most reliable form of post-training.
Pro: prerequisite for every preference method — universally needed.
Pro: scales from 1k LIMA-style runs to 10M+ Llama-scale runs with the same loss.
Pro: works with LoRA, QLoRA, full FT and every framework in the ecosystem.
Con: cannot teach response quality preferences — needs DPO / ORPO / KTO / GRPO on top.
Con: quality dominated by data; trainer choice is secondary.
Con: catastrophic forgetting risk on narrow-domain full FT; mitigated by LoRA or data mixing.
Con: chat-template fragility — train and serve must use identical templates.
Con: tokeniser changes (new special tokens) require resizing embeddings.

Practical implementation notes#

Where SFT fits in the Yobitel stack#

References

Training language models to follow instructions with human feedback (InstructGPT, Ouyang et al., 2022) · arXiv
LIMA: Less Is More for Alignment (Zhou et al., 2023) · arXiv / NeurIPS 2023
The Llama 3 Herd of Models · arXiv (Meta AI, 2024)
TRL — Transformer Reinforcement Learning (SFTTrainer) · GitHub
Tulu 3: Pushing Frontiers in Open Language Model Post-Training · arXiv (Allen AI, 2024)

Supervised Fine-Tuning (SFT)

Overview#

How it works: the loss, the mask and the optimiser#

Variants and architectural choices#

Where it is used today: the post-training pipeline in 2026#

Trade-offs and known limitations#

Practical implementation notes#

Where SFT fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel

Supervised Fine-Tuning (SFT)

Overview#

How it works: the loss, the mask and the optimiser#

Variants and architectural choices#

Where it is used today: the post-training pipeline in 2026#

Trade-offs and known limitations#

Practical implementation notes#

Where SFT fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel