Direct Preference Optimisation (DPO)

TL;DR

Direct Preference Optimisation (Rafailov et al., NeurIPS 2023, arXiv:2305.18290) shows that the standard RLHF objective has a closed-form optimum that can be expressed as a supervised classification loss on preference pairs — eliminating the explicit reward model and the PPO loop entirely.
The DPO loss is `-log σ(β · (log π_θ(y_w|x)/π_ref(y_w|x) - log π_θ(y_l|x)/π_ref(y_l|x)))` — a Bradley-Terry classifier on the implicit reward defined by the policy's likelihood ratio against a frozen reference. Two models in memory (policy + frozen reference), one supervised optimiser, well-behaved gradients.
Cuts the post-training pipeline from four-model PPO RLHF (policy, reference, reward, value) to a two-model supervised fine-tune. Typically half the GPU memory and a small fraction of the wall-clock time, with no PPO instability.
The DPO family is the default for open-weights alignment in 2026: Llama 3 Instruct uses DPO; Qwen 2/2.5/3 Instruct use DPO + SimPO + ORPO; Mistral Large 2 uses DPO; Phi-3 Instruct uses DPO. Successors include IPO (Azar et al., 2023), KTO (Ethayarajh et al., 2024), ORPO (Hong et al., 2024), SimPO (Meng et al., 2024).
Yobitel relevance: the Yobibyte FineTune resource exposes DPO and SimPO as managed methods alongside SFT and LoRA, so customer alignment runs on Yobitel NeoCloud H100 / H200 capacity inherit the same recipe the open-weights frontier uses — no PPO orchestration to operate.

Overview#

Direct Preference Optimisation is the answer to a question that the original RLHF formulation never quite asked: do we actually need the reward model and the RL loop, or were they an unnecessary intermediate? Rafailov, Sharma, Mitchell, Manning, Ermon and Finn's 2023 NeurIPS paper showed — with a remarkably short derivation — that the answer is no. The RLHF objective has a closed-form optimal policy, that optimum can be substituted into the Bradley-Terry preference loss, and the resulting expression depends only on the policy's log-likelihoods of chosen and rejected responses against a frozen reference. There is no reward model and no PPO; the alignment problem becomes a supervised classification task on preference pairs.

The mechanical consequence is large. PPO-style RLHF holds four models in memory at training time — the policy being trained, a frozen reference, the reward model, and the value head — and runs a sample-score-update loop that is notoriously unstable. DPO holds two models — the policy and the frozen reference — and runs a single supervised optimiser over a fixed preference dataset. Memory drops by roughly half. Wall-clock training time drops to a small fraction. Reproducibility rises sharply because the loss has well-behaved gradients without the variance and clipping pathologies of policy-gradient methods. Teams without dedicated RL expertise can get DPO working on the first try.

By mid-2026 the field has settled. Llama 3 Instruct (Meta, July 2024) explicitly uses DPO in its alignment stack, applied iteratively with rejection-sampled completions. Qwen 2 and Qwen 2.5 Instruct (Alibaba, 2024-2025) use a DPO + SimPO ensemble. Mistral Large 2 (Mistral, July 2024) uses DPO. Phi-3 Instruct (Microsoft, 2024) uses DPO. Gemma 2 (Google, June 2024) uses ORPO (a DPO derivative that combines SFT and preference loss). Closed-frontier labs (OpenAI, Anthropic, Google DeepMind) likely still use PPO RLHF for the final pass on their flagship models — the on-policy signal remains useful at the very top — but use DPO heavily in iteration cycles where wall-clock and stability matter more than the last point of quality.

This entry helps you understand DPO well enough to choose between DPO, PPO RLHF and the modern DPO variants (IPO, KTO, ORPO, SimPO) for an alignment workload — and to wire up the training loop in HuggingFace TRL or a similar library with the correct β, the correct reference handling, and the correct iterative-DPO scheme when you need it. For teams running alignment on the Yobibyte FineTune resource on Yobitel NeoCloud, DPO and SimPO are the two managed preference-optimisation methods on offer; the maths in this entry is what they implement under the hood.

How it works: the closed-form derivation, the loss, and the β regulariser#

RLHF in its standard formulation has three stages. Stage 1 is supervised fine-tuning (SFT) of a base model on demonstration data to produce a reasonable initial policy π_ref. Stage 2 is reward modelling: collect preference pairs (prompt x, chosen response y_w, rejected response y_l), fit a reward model r_φ(x, y) under the Bradley-Terry assumption that human preference probability is σ(r(y_w) - r(y_l)). Stage 3 is policy optimisation: maximise E[r_φ(x, y)] - β * KL(π_θ || π_ref), where the KL term keeps the trained policy from drifting too far from the SFT reference and β controls how strongly. PPO is the workhorse optimiser for stage 3.

Rafailov et al. observed that the optimal policy for the stage-3 objective has a closed form. Setting the gradient to zero, π*(y|x) ∝ π_ref(y|x) * exp(r(x, y) / β). Inverting this gives r(x, y) = β * log(π*(y|x) / π_ref(y|x)) + Z(x), where Z(x) is a partition-function term that depends only on x and not on y. Substituting this implicit reward back into the Bradley-Terry preference loss, the Z(x) terms cancel (they appear identically in the chosen and rejected terms and subtract out), leaving a loss that depends only on the policy, the frozen reference, and the preference data.

The resulting DPO loss is the cleanest single equation in modern alignment. For a preference triple (x, y_w, y_l) — prompt, chosen response, rejected response — the loss is L_DPO = -log σ(β * (log π_θ(y_w|x)/π_ref(y_w|x) - log π_θ(y_l|x)/π_ref(y_l|x))). Each log-likelihood ratio plays the role of an implicit reward: the policy is encouraged to assign higher likelihood to y_w than y_l, normalised against what the reference model assigned. The β term controls regularisation strength — the same role it played in the RLHF objective, but here it appears inside the sigmoid as a temperature on the implicit reward margin.

The β value matters more than is sometimes appreciated. Small β (0.01-0.1) gives the policy room to drift from the reference, producing larger behaviour changes but risking distribution shift away from the SFT regime. Large β (0.5-1.0) keeps the policy close to the reference, producing conservative changes that may not capture the full preference signal. The Llama 3 technical report uses β = 0.1 for its DPO stage; HuggingFace TRL defaults to β = 0.1; iterative DPO often sweeps β downward across rounds. Getting β wrong is the most common source of disappointing DPO results.

The gradient of the DPO loss has a clean interpretation. Differentiating with respect to π_θ, the policy is pushed up on chosen responses and down on rejected responses, weighted by σ(margin) — the gradient is large when the margin is small (the policy and reference disagree about which response is better) and small when the margin is large (the policy is already confident). This adaptive weighting is part of why DPO is so stable in practice: examples the model has already learned contribute little gradient, and examples it has not learned contribute a lot.

Input data: triples (prompt x, chosen response y_w, rejected response y_l). Same data as RLHF reward modelling.
Models in memory: policy π_θ (trainable) + reference π_ref (frozen SFT checkpoint). Two models, half the memory of PPO RLHF.
Loss: -log σ(β * (log π_θ(y_w|x)/π_ref(y_w|x) - log π_θ(y_l|x)/π_ref(y_l|x))).
β: regularisation strength. Standard 0.1; smaller for more aggressive updates, larger for conservative.
Gradient: adaptive — large when policy disagrees with reference, small when it agrees. Stable training without PPO clipping.
Off-policy: trains on a fixed preference dataset; the policy is never sampled from during training. Cheap but loses the on-policy signal.

python

# dpo_loss.py — faithful DPO loss per Rafailov et al. 2023 (arXiv:2305.18290).
import torch
import torch.nn.functional as F

def dpo_loss(
    policy_logp_chosen: torch.Tensor,   # log π_θ(y_w | x) — sum of token log-probs
    policy_logp_rejected: torch.Tensor, # log π_θ(y_l | x)
    ref_logp_chosen: torch.Tensor,      # log π_ref(y_w | x), no grad
    ref_logp_rejected: torch.Tensor,    # log π_ref(y_l | x), no grad
    beta: float = 0.1,
) -> torch.Tensor:
    """Standard DPO loss. policy_* require grad, ref_* are detached.

    Each per-example loss is -log sigmoid(beta * (logratio_chosen - logratio_rejected)).
    The reference cancels into a logratio per response — that's the DPO trick.
    """
    chosen_logratio   = policy_logp_chosen   - ref_logp_chosen
    rejected_logratio = policy_logp_rejected - ref_logp_rejected
    logits = beta * (chosen_logratio - rejected_logratio)
    return -F.logsigmoid(logits).mean()

# Smoke test on synthetic logprobs.
torch.manual_seed(0)
policy_chosen   = torch.randn(8, requires_grad=True)
policy_rejected = torch.randn(8, requires_grad=True)
ref_chosen      = torch.randn(8)
ref_rejected    = torch.randn(8)
loss = dpo_loss(policy_chosen, policy_rejected, ref_chosen, ref_rejected, beta=0.1)
loss.backward()
print("DPO loss:", loss.item())
print("policy_chosen grad mean:", policy_chosen.grad.mean().item())
# Expect: loss > 0; grad on policy_chosen mostly negative (push chosen up).

Production stacks should use HuggingFace TRL's `DPOTrainer`, which handles the reference model, sequence-length log-prob aggregation, optional reference-free SimPO, and the iterative-DPO outer loop. Hand-rolling the loss is fine for understanding but the wiring around it (padding-aware log-probs, gradient checkpointing, FP8 reference) is where most bugs happen.

Variants and architectural choices: IPO, KTO, ORPO, SimPO and iterative DPO#

DPO has spawned a family of preference-optimisation methods, each addressing a specific limitation of the original formulation. The shared template — closed-form loss on a preference-derived signal — is preserved; the differences are in what signal is used, what regularisation appears, and what data format is required.

IPO (Identity Preference Optimisation, Azar et al., 2023, arXiv:2310.12036) modifies the DPO loss to avoid an over-fitting pathology. DPO with deterministic preferences can drive the implicit-reward margin to infinity, which manifests as the policy assigning vanishingly small probability to the reference response. IPO replaces the sigmoid with a squared loss on the margin, regularising it to a finite target. Useful when preference data is noisy or when DPO training shows divergent margins.

KTO (Kahneman-Tversky Optimisation, Ethayarajh et al., 2024, arXiv:2402.01306) drops the pair requirement entirely. Where DPO needs (chosen, rejected) pairs, KTO works with unpaired binary labels — 'this single response was good' or 'this single response was bad'. The loss uses a prospect-theory utility function with separate gain and loss coefficients. Useful when data collection is constrained to thumbs-up/thumbs-down feedback rather than pairwise comparisons (a common production telemetry shape).

ORPO (Odds Ratio Preference Optimisation, Hong et al., 2024, arXiv:2403.07691) combines SFT and DPO into a single training stage. The loss adds an odds-ratio penalty on the rejected response to the standard cross-entropy on the chosen response, removing the need for a separate SFT stage. Gemma 2 uses ORPO. Useful when you want to compress the post-training pipeline from SFT + DPO to a single fine-tune.

SimPO (Simple Preference Optimisation, Meng et al., 2024, arXiv:2405.14734) removes the reference model entirely. Instead of the log-ratio against π_ref, SimPO uses length-normalised log-likelihoods directly: the chosen response's per-token log-probability minus the rejected's, scaled by β and offset by a target margin γ. Memory drops from two models to one (no reference), and SimPO often outperforms DPO on chat benchmarks despite being simpler. Qwen 2.5 Instruct uses SimPO in part of its alignment stack.

Iterative DPO is not a different loss but a different training schedule. Run DPO to produce policy π_1; treat π_1 as the new reference; collect new preferences (either re-label the original data with π_1's completions, or generate fresh completions from π_1 and label them); run DPO again to produce π_2; repeat. Llama 3 uses three rounds of iterative DPO with rejection-sampling-based preference collection. Iterative DPO closes much of the gap to PPO RLHF on tasks where the policy needs to drift far from the SFT reference.

The trade-off table below summarises when each variant earns its place. The empirical state of the art in mid-2026 is one of: iterative DPO with β = 0.1 (Llama 3 recipe), ORPO single-stage (Gemma 2 recipe), or DPO + SimPO ensemble (Qwen 2.5 recipe). The choice depends on data shape, compute budget, and how far the policy needs to drift from the SFT reference.

Method	Reference model?	Data format	Key idea	When to use
DPO (Rafailov 2023)	Yes (frozen π_ref)	Pairs (x, y_w, y_l)	Bradley-Terry on implicit reward β log(π/π_ref)	Default for paired preference data.
IPO (Azar 2023)	Yes	Pairs	Squared loss on margin (DPO uses sigmoid)	When DPO over-fits margins or data is noisy.
KTO (Ethayarajh 2024)	Yes	Unpaired binary (good/bad)	Prospect-theory utility on single responses	When data is thumbs-up/down, not pairs.
ORPO (Hong 2024)	No	Pairs (with SFT data)	SFT cross-entropy + odds-ratio penalty in one stage	When compressing SFT + DPO into one fine-tune.
SimPO (Meng 2024)	No	Pairs	Length-normalised log-prob difference, target margin γ	When memory matters or reference is not available.
Iterative DPO	Yes, updates per round	Pairs, regenerated per round	Re-run DPO with previous round's policy as new reference	When policy needs to drift far from SFT (Llama 3 recipe).
PPO RLHF (comparison)	Yes + reward + value models	Pairs (for RM); on-policy samples	Policy gradient with explicit reward model	When on-policy signal matters; frontier-closed labs.

Iterative DPO is the most common upgrade path from single-pass DPO in mid-2026. The Llama 3 technical report's ablations show roughly half the quality gap between single-pass DPO and PPO RLHF closes after the second iteration, and most of the rest by the third. Three rounds is the standard.

Where it is used today: open-weights alignment in 2026#

DPO and its variants are now the default for open-weights preference alignment. The major frontier labs that publish their post-training recipes have converged on the DPO family; the ones that have not published are believed to use it heavily in iteration even if their final pass uses PPO. The table below documents the published recipes for the major open-weights Instruct families.

Llama 3 Instruct (Meta, July 2024) uses iterative DPO over three rounds, with preferences generated by rejection sampling: the policy generates N candidates per prompt, a reward model scores them, the highest-scoring pair (k_best, k_worst) becomes the preference pair, and the policy is updated by DPO. β = 0.1, three iterations. The technical report attributes most of the alignment gain to this iterative scheme rather than to the single-pass DPO loss.

Qwen 2.5 Instruct (Alibaba, September 2024) uses a multi-method ensemble: SFT on demonstration data, then DPO with synthetic and human preferences, then a SimPO pass on a subset of pairs where the policy and reference disagree most strongly. The Qwen 3 Instruct release (2025) extends this with KTO on unpaired thumbs-up/down feedback from a deployed-model telemetry pipeline.

Mistral Large 2 Instruct (Mistral, July 2024) uses single-pass DPO at β = 0.1 on a curated preference dataset. Phi-3 Instruct (Microsoft, April 2024) uses DPO with a synthetic preference pipeline. Gemma 2 (Google, June 2024) uses ORPO as a single-stage fine-tune that combines SFT and preference loss. DeepSeek-V3 (DeepSeek, December 2024) uses GRPO for reasoning fine-tuning but DPO for the general-chat alignment pass.

Closed-frontier labs are believed to still use PPO RLHF for the final pass on their flagship models — OpenAI's reports on GPT-4 mention RLHF; Anthropic's reports on Claude describe RLAIF (RL with AI feedback) which is RL-based; Google DeepMind's Gemini technical paper describes RL post-training. The on-policy signal from PPO remains valuable for the last point of quality. But internal iteration cycles at these labs almost certainly use DPO and its variants for the wall-clock and stability wins.

On the Yobitel side, the Yobibyte FineTune resource exposes DPO and SimPO as managed preference-optimisation methods, alongside SFT and LoRA / QLoRA. Customers bring preference data in the standard (prompt, chosen, rejected) triple format; Yobibyte runs the fine-tune on Yobitel NeoCloud H100 / H200 capacity in a UK or EU sovereignty region, with the trained checkpoint published back into the customer's marketplace workspace. This means customer alignment runs inherit the same recipe the open-weights frontier uses, without operating a four-model PPO orchestration loop themselves.

Model	Released	Alignment method	β	Notes
Llama 3 Instruct	Jul 2024	Iterative DPO (3 rounds)	0.1	Rejection sampling for preferences.
Llama 3.1 Instruct	Jul 2024	Iterative DPO + SFT	0.1	Same recipe, extended context.
Qwen 2.5 Instruct	Sep 2024	DPO + SimPO	0.1	SimPO pass on high-disagreement pairs.
Qwen 3 Instruct	2025	DPO + SimPO + KTO	0.05-0.1	KTO on unpaired telemetry.
Mistral Large 2 Instruct	Jul 2024	DPO (single pass)	0.1	Curated preference dataset.
Phi-3 Instruct	Apr 2024	DPO	0.1	Synthetic preference pipeline.
Phi-4 Instruct	Dec 2024	DPO + iterative	0.1	Synthetic + human pairs.
Gemma 2	Jun 2024	ORPO (single-stage)	n/a	Combines SFT + preference loss.
DeepSeek-V3 (chat)	Dec 2024	DPO (+ GRPO for reasoning)	0.1	Different methods per capability.

Trade-offs and known limitations#

DPO's central trade is off-policy vs on-policy. PPO RLHF is on-policy: at every step the policy samples fresh responses, the reward model scores them, and the policy updates from its current behaviour distribution. DPO is off-policy: training runs over a fixed preference dataset that was collected before training started. The on-policy signal is richer because it explores the current policy's behaviour; the off-policy signal is cheaper but limited to the data distribution that produced the preferences. Empirically, single-pass DPO matches PPO on most chat benchmarks but lags on tasks where the policy needs to drift far from the SFT reference — advanced reasoning, hard refusal targets, multi-turn agent behaviour. Iterative DPO partially closes this gap by re-collecting preferences from the updated policy each round.

The reference-model dependency is the next constraint. Vanilla DPO requires a frozen reference π_ref to be available throughout training, which doubles the parameter memory at training time. For a 70B-parameter model fine-tune, that means two 140 GB BF16 parameter sets in HBM, plus optimiser state on the policy. ZeRO-3 / FSDP sharding handles this, but it is non-trivial. SimPO removes the reference entirely by using length-normalised log-probabilities as the implicit reward signal; ORPO removes it by combining SFT with the preference loss in a single objective. When training memory is the binding constraint, SimPO or ORPO is the right pick.

Margin over-fitting is the well-known DPO pathology. The sigmoid in the DPO loss saturates as the implicit-reward margin grows; the gradient on already-confident examples drops to zero, while the policy continues to push the margin even further on examples it has not yet mastered. This can drive the policy's likelihood on the reference response to vanishingly small values, manifesting as 'the model has forgotten how the reference would have answered'. The standard cures are early stopping (monitor margin distribution and stop when the median margin exceeds about 5), IPO's squared-loss replacement, or DPO + SFT mixing (continue applying a small SFT cross-entropy term throughout).

Length bias is another well-known DPO failure mode. Without length normalisation, DPO can learn to game the implicit reward by producing longer or shorter responses (depending on which direction the preference data is biased toward). SimPO's per-token normalisation handles this directly; standard DPO recipes apply length normalisation explicitly inside the log-prob calculation.

Distribution shift relative to the reference is a real and frequently underappreciated limit. DPO derives its loss assuming the preferences are generated by an optimal policy that is reachable from π_ref under the β-weighted KL constraint. If the desired behaviour is far outside the reference's distribution, DPO can fail to reach it — the implicit reward signal weakens as the policy drifts and the reference loses calibration on the new behaviour. Iterative DPO addresses this by updating the reference per round; for very large behaviour changes (e.g., aligning a base model directly to a chat persona without intermediate SFT), DPO is the wrong tool — SFT first, then DPO.

The data-collection cost has not gone away. DPO eliminates the reward model and the RL loop, but the preference pairs themselves still cost the same to collect. Human annotators are still the bottleneck for high-quality alignment data; synthetic preferences (one model judging another's outputs) are cheaper but introduce judge-model biases. The economic equation for alignment data is the same as it was under PPO RLHF.

Practical implementation notes#

Libraries that implement DPO well in 2026: HuggingFace TRL (`trl.DPOTrainer`) is the de facto standard, with built-in support for DPO, IPO, KTO, ORPO, SimPO and CPO under a single configuration knob; PyTorch + Accelerate is the substrate underneath for distributed training; DeepSpeed-Chat covers the same ground with a ZeRO-aware reference-model implementation; Axolotl exposes DPO as a YAML-configured fine-tune recipe popular in the open-weights community; the Llama-factory and OpenRLHF libraries cover the iterative-DPO outer loop.

Standard recipe for a single-pass DPO fine-tune. Start from an SFT checkpoint — DPO is post-training, not from-scratch alignment. Use LoRA or QLoRA on the policy if memory is tight (full-finetune is preferable when budget allows, especially on the attention output projections and FFN); the LoRA rank for DPO is typically smaller than for SFT, around 16-32. Set β = 0.1. Use a learning rate of 5e-7 to 5e-6 (smaller than SFT; the loss is sharper). Train for one to two epochs over the preference dataset. Monitor the chosen-vs-rejected reward gap (the implicit reward margin) — it should rise during training and plateau; if it diverges to very large values, apply early stopping or switch to IPO.

Common gotchas. Forgetting to detach the reference model — the reference contribution to the loss must have no gradient, otherwise the optimiser updates the reference too and DPO collapses immediately. Mixing chosen and rejected in the wrong order in the loss — the sign convention is y_w (chosen) minus y_l (rejected), getting it inverted produces a loss that pushes the policy in the wrong direction. Using token-level log-probabilities without padding-aware aggregation — DPO operates on sequence-level log-likelihoods (sum of token log-probs over the response, ignoring prompt and padding); HuggingFace TRL handles this automatically but custom implementations often get it wrong. Reference model precision mismatch — if the reference is loaded in BF16 and the policy in FP32, the log-probability scales differ subtly and the implicit reward signal is noisy.

Iterative DPO operational pattern. Round 1: train DPO on the SFT-reference-relative preferences. Round 2: copy the round-1 policy as the new reference; generate fresh completions from it; either re-label the original preferences with the new completions, or run a fresh preference-collection pass (rejection sampling against a reward model is the Llama 3 recipe). Train round-2 DPO. Repeat for round 3. Tracking the changing reference is the most common bug; the Llama 3 paper and TRL's iterative-DPO example both document the bookkeeping explicitly.

Sizing arithmetic for a planning conversation on Yobitel NeoCloud. A single-pass DPO fine-tune of a 70B model in BF16 needs roughly 280 GB of parameter memory (140 GB policy + 140 GB frozen reference), plus optimiser state on the policy (another 280 GB if AdamW without sharding). With ZeRO-3 / FSDP sharding across an 8x H100 SXM5 node (640 GB total HBM), the run fits comfortably. With LoRA the parameter memory drops by 90 % and the same node can host multiple concurrent DPO runs. A typical preference dataset of 50,000 pairs trains in 6-12 hours on a single 8x H100 node for the 70B model. Iterative DPO triples wall-clock time but produces materially better alignment quality. The Yobibyte FineTune resource runs this exact recipe transparently — customers submit preference data, choose DPO or SimPO, and receive a trained checkpoint back, without operating the training cluster themselves.

Evaluation discipline matters as much as the loss function. The standard chat-quality benchmarks for DPO are MT-Bench, AlpacaEval 2 and Arena-Hard; for reasoning, GSM8K and MATH; for refusal, the Anthropic HH and BeaverTails datasets. Always evaluate both the SFT baseline and the DPO output on the same suite — a DPO run that improves MT-Bench but regresses on GSM8K is a real and common pattern, and only shows up under broad evaluation. The Yobibyte FineTune resource produces an evaluation report against the same benchmarks the Yobitel InferenceBench leaderboard uses, so DPO-trained customer models can be compared like-for-like with the marketplace baselines.

The single most common DPO bug is silently updating the reference model. The frozen reference must be wrapped in `torch.no_grad()` or held as a separate parameter group with `requires_grad=False`; otherwise the optimiser updates it alongside the policy and DPO collapses to an identity transformation. HuggingFace TRL handles this correctly; custom implementations often do not.

Where DPO fits in the Yobitel stack#

Direct Preference Optimisation and SimPO are the two preference-alignment methods exposed by the Yobibyte FineTune resource. Customers bring preference data in the standard (prompt, chosen, rejected) triple format — the same data shape they would have collected for PPO RLHF — and Yobibyte runs the fine-tune on Yobitel NeoCloud H100 / H200 capacity in a chosen sovereignty region (UK NCSC OFFICIAL, EU Data Boundary, US FedRAMP-equivalent). The trained checkpoint is published back into the customer's marketplace workspace as a private model, served through the same OpenAI-compatible endpoint as the base catalogue models.

The alignment recipe Yobibyte uses follows the open-weights frontier: SFT first (if the customer is starting from a base model rather than an Instruct checkpoint), then DPO at β = 0.1, with iterative DPO available as an opt-in for customers who need the additional behaviour drift the iterative scheme provides. SimPO is offered as an alternative when memory budget or reference-availability matters. The recipe choice is exposed in the FineTune configuration; the underlying scheduler reasons about cluster placement and ZeRO-3 / FSDP sharding.

Yobitel NeoCloud — the H100 SXM5 / H200 fleet — sizes per-run GPU allocation using the same sizing arithmetic in this entry. A 70B BF16 single-pass DPO run lands on a single 8x H100 SXM5 node; a 405B DPO run lands on two nodes with InfiniBand NDR between them; LoRA-only runs share nodes across multiple customers. The published per-GPU-hour price-list maps directly into a budget for a DPO fine-tune of any size.

Yobitel InferenceBench publishes evaluation results for the Instruct variants of the major open-weights families on the same MT-Bench / AlpacaEval / Arena-Hard suite that Yobibyte's FineTune evaluation report uses. For teams wanting to verify their DPO-trained checkpoint against the published baselines, InferenceBench is the empirical anchor — a customer can compare their fine-tuned Llama 3.1 70B Instruct against the public Llama 3.1 70B Instruct on the same metrics with no methodology gap.

References

TL;DR

Direct Preference Optimisation (Rafailov et al., NeurIPS 2023, arXiv:2305.18290) shows that the standard RLHF objective has a closed-form optimum that can be expressed as a supervised classification loss on preference pairs — eliminating the explicit reward model and the PPO loop entirely.
The DPO loss is `-log σ(β · (log π_θ(y_w|x)/π_ref(y_w|x) - log π_θ(y_l|x)/π_ref(y_l|x)))` — a Bradley-Terry classifier on the implicit reward defined by the policy's likelihood ratio against a frozen reference. Two models in memory (policy + frozen reference), one supervised optimiser, well-behaved gradients.
Cuts the post-training pipeline from four-model PPO RLHF (policy, reference, reward, value) to a two-model supervised fine-tune. Typically half the GPU memory and a small fraction of the wall-clock time, with no PPO instability.
The DPO family is the default for open-weights alignment in 2026: Llama 3 Instruct uses DPO; Qwen 2/2.5/3 Instruct use DPO + SimPO + ORPO; Mistral Large 2 uses DPO; Phi-3 Instruct uses DPO. Successors include IPO (Azar et al., 2023), KTO (Ethayarajh et al., 2024), ORPO (Hong et al., 2024), SimPO (Meng et al., 2024).
Yobitel relevance: the Yobibyte FineTune resource exposes DPO and SimPO as managed methods alongside SFT and LoRA, so customer alignment runs on Yobitel NeoCloud H100 / H200 capacity inherit the same recipe the open-weights frontier uses — no PPO orchestration to operate.

Overview#

How it works: the closed-form derivation, the loss, and the β regulariser#

Input data: triples (prompt x, chosen response y_w, rejected response y_l). Same data as RLHF reward modelling.
Models in memory: policy π_θ (trainable) + reference π_ref (frozen SFT checkpoint). Two models, half the memory of PPO RLHF.
Loss: -log σ(β * (log π_θ(y_w|x)/π_ref(y_w|x) - log π_θ(y_l|x)/π_ref(y_l|x))).
β: regularisation strength. Standard 0.1; smaller for more aggressive updates, larger for conservative.
Gradient: adaptive — large when policy disagrees with reference, small when it agrees. Stable training without PPO clipping.
Off-policy: trains on a fixed preference dataset; the policy is never sampled from during training. Cheap but loses the on-policy signal.

python

# dpo_loss.py — faithful DPO loss per Rafailov et al. 2023 (arXiv:2305.18290).
import torch
import torch.nn.functional as F

def dpo_loss(
    policy_logp_chosen: torch.Tensor,   # log π_θ(y_w | x) — sum of token log-probs
    policy_logp_rejected: torch.Tensor, # log π_θ(y_l | x)
    ref_logp_chosen: torch.Tensor,      # log π_ref(y_w | x), no grad
    ref_logp_rejected: torch.Tensor,    # log π_ref(y_l | x), no grad
    beta: float = 0.1,
) -> torch.Tensor:
    """Standard DPO loss. policy_* require grad, ref_* are detached.

    Each per-example loss is -log sigmoid(beta * (logratio_chosen - logratio_rejected)).
    The reference cancels into a logratio per response — that's the DPO trick.
    """
    chosen_logratio   = policy_logp_chosen   - ref_logp_chosen
    rejected_logratio = policy_logp_rejected - ref_logp_rejected
    logits = beta * (chosen_logratio - rejected_logratio)
    return -F.logsigmoid(logits).mean()

# Smoke test on synthetic logprobs.
torch.manual_seed(0)
policy_chosen   = torch.randn(8, requires_grad=True)
policy_rejected = torch.randn(8, requires_grad=True)
ref_chosen      = torch.randn(8)
ref_rejected    = torch.randn(8)
loss = dpo_loss(policy_chosen, policy_rejected, ref_chosen, ref_rejected, beta=0.1)
loss.backward()
print("DPO loss:", loss.item())
print("policy_chosen grad mean:", policy_chosen.grad.mean().item())
# Expect: loss > 0; grad on policy_chosen mostly negative (push chosen up).

Variants and architectural choices: IPO, KTO, ORPO, SimPO and iterative DPO#

Method	Reference model?	Data format	Key idea	When to use
DPO (Rafailov 2023)	Yes (frozen π_ref)	Pairs (x, y_w, y_l)	Bradley-Terry on implicit reward β log(π/π_ref)	Default for paired preference data.
IPO (Azar 2023)	Yes	Pairs	Squared loss on margin (DPO uses sigmoid)	When DPO over-fits margins or data is noisy.
KTO (Ethayarajh 2024)	Yes	Unpaired binary (good/bad)	Prospect-theory utility on single responses	When data is thumbs-up/down, not pairs.
ORPO (Hong 2024)	No	Pairs (with SFT data)	SFT cross-entropy + odds-ratio penalty in one stage	When compressing SFT + DPO into one fine-tune.
SimPO (Meng 2024)	No	Pairs	Length-normalised log-prob difference, target margin γ	When memory matters or reference is not available.
Iterative DPO	Yes, updates per round	Pairs, regenerated per round	Re-run DPO with previous round's policy as new reference	When policy needs to drift far from SFT (Llama 3 recipe).
PPO RLHF (comparison)	Yes + reward + value models	Pairs (for RM); on-policy samples	Policy gradient with explicit reward model	When on-policy signal matters; frontier-closed labs.

Where it is used today: open-weights alignment in 2026#

Model	Released	Alignment method	β	Notes
Llama 3 Instruct	Jul 2024	Iterative DPO (3 rounds)	0.1	Rejection sampling for preferences.
Llama 3.1 Instruct	Jul 2024	Iterative DPO + SFT	0.1	Same recipe, extended context.
Qwen 2.5 Instruct	Sep 2024	DPO + SimPO	0.1	SimPO pass on high-disagreement pairs.
Qwen 3 Instruct	2025	DPO + SimPO + KTO	0.05-0.1	KTO on unpaired telemetry.
Mistral Large 2 Instruct	Jul 2024	DPO (single pass)	0.1	Curated preference dataset.
Phi-3 Instruct	Apr 2024	DPO	0.1	Synthetic preference pipeline.
Phi-4 Instruct	Dec 2024	DPO + iterative	0.1	Synthetic + human pairs.
Gemma 2	Jun 2024	ORPO (single-stage)	n/a	Combines SFT + preference loss.
DeepSeek-V3 (chat)	Dec 2024	DPO (+ GRPO for reasoning)	0.1	Different methods per capability.

Direct Preference Optimisation (DPO)

Overview#

How it works: the closed-form derivation, the loss, and the β regulariser#

Variants and architectural choices: IPO, KTO, ORPO, SimPO and iterative DPO#

Where it is used today: open-weights alignment in 2026#

Trade-offs and known limitations#

Practical implementation notes#

Where DPO fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel

Direct Preference Optimisation (DPO)

Overview#

How it works: the closed-form derivation, the loss, and the β regulariser#

Variants and architectural choices: IPO, KTO, ORPO, SimPO and iterative DPO#

Where it is used today: open-weights alignment in 2026#

Trade-offs and known limitations#

Practical implementation notes#

Where DPO fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel