TL;DR
- RLHF aligns a language model with human preferences by (1) supervised fine-tuning on instruction-response pairs, (2) training a reward model on pairwise human preferences, then (3) reinforcement-learning the policy to maximise reward under a KL penalty against the reference model.
- Foundations: Christiano et al. 2017 (arXiv:1706.03741) for the preference-learning framework; Stiennon et al. 2020 for summarisation; InstructGPT (Ouyang et al. 2022, arXiv:2203.02155) for the canonical LLM pipeline that became ChatGPT.
- Every aligned chat model in production — GPT-4o, Claude Sonnet 4, Gemini 2.5, Llama 3 Instruct, Qwen 3, Mistral Large Instruct, Gemma 3 Instruct — uses RLHF or one of its lineal descendants (DPO, RLAIF, GRPO, RLVR).
- Failure modes are well-catalogued and predictable: reward hacking, mode collapse, sycophancy, length bias, alignment tax. Mitigations (KL penalty tuning, reward-model auditing, ensemble RMs, iterative relabelling) are now standard practice.
- By mid-2026 the post-training stack is rarely 'just PPO'; it is typically SFT → DPO (preference data) → GRPO (verifiable rewards) → final PPO or DPO polish, with RLAIF / constitutional AI used to scale label generation.
Overview#
A model pretrained only on next-token prediction over the open web is a fluent text continuer, not a useful assistant. It will continue a question with another plausible question, complete a coding prompt with the comment style of its training data rather than working code, refuse nothing, and confidently produce harmful content if asked. The pretraining objective optimises for matching the distribution of training data; what users want is matching their intent. The gap between those two things is what RLHF exists to close.
RLHF frames alignment as preference learning. Rather than asking humans to specify the right answer (which is expensive, inconsistent and often subjective), ask them which of two candidate answers is better. Train a model to predict those preferences. Use that model as a reward signal to fine-tune the language model with reinforcement learning, while a KL-divergence penalty against the SFT reference policy stops the optimiser from drifting into incoherent text just to game the reward.
The technique came from a different field. Deep RL from human preferences (Christiano, Leike, et al. 2017) was developed for robotics and Atari, where specifying a reward function for 'do a backflip' is harder than judging which of two attempts looked more like one. Learning to Summarise from Human Feedback (Stiennon et al. 2020) ported it to language. InstructGPT (Ouyang et al. 2022) operationalised the pipeline at GPT-3 scale, ChatGPT shipped the result publicly in November 2022, and within twelve months every credible frontier lab had built its own RLHF stack.
This entry is the reference for the operator and applied researcher who needs to understand the full RLHF pipeline: the three stages, what the reward-model loss is doing, what PPO actually optimises in this setting, the failure modes that bite teams in production, and the 2026 landscape of variants and successors (DPO, GRPO, RLAIF, RLVR). This entry helps you understand RLHF and its descendants well enough to pick the right post-training algorithm for your data (DPO for preference pairs, GRPO for verifiable rewards, full PPO only when you genuinely need it), size the compute footprint a frontier alignment run actually needs, and recognise reward hacking before it ships. If you are training aligned models on Yobitel NeoCloud or consuming RLHF-trained open weights through Yobibyte, this matters because the H100 / H200 / B200 capacity an RLHF or DPO run requires is one of the larger jobs that hits NeoCloud, and every Instruct model in the Yobibyte catalogue (Llama 3.1 Instruct, Qwen 3 Instruct, Mistral Large 2 Instruct) was produced by some variant of the pipeline below.
How it works: the three-stage InstructGPT pipeline#
InstructGPT codified the pipeline that every subsequent RLHF system has followed in some form. The starting point is a pretrained base model — a next-token predictor with no instruction-following ability. The output is an aligned policy that responds usefully and refuses appropriately. The path between is three sequential stages.
Stage 1 — Supervised Fine-Tuning (SFT). Collect a high-quality dataset of prompt-response pairs written by human labellers (typically ~10,000-100,000 examples for frontier RLHF). Fine-tune the pretrained base model on this dataset using the standard cross-entropy next-token loss. The SFT model can now follow instructions in roughly the right format. It is not yet well-aligned with the full preference distribution, but it is in the right region of policy space for RL.
Stage 2 — Reward Model (RM). For each of many prompts (~10,000-1,000,000), sample K candidate responses from the SFT model (typically K = 4-9). Have human labellers rank the K responses, producing K·(K-1)/2 pairwise comparisons per prompt. Train a separate Transformer (same architecture as the policy, initialised from the SFT checkpoint, with a scalar regression head replacing the LM head) to score responses such that preferred responses score higher than rejected ones, under the Bradley-Terry pairwise loss.
Stage 3 — RL fine-tuning. Treat the SFT model as a policy π_θ. For each prompt, sample a response from the current policy, score it with the reward model, and update the policy with PPO to increase expected reward — subject to a KL-divergence penalty against the frozen SFT policy that prevents drift into reward-hacking gibberish. The loss is L = E[r̂(p, y) − β · KL(π_θ(·|p) || π_SFT(·|p))], with β typically 0.01-0.1.
Four models live in memory simultaneously during stage 3: the active policy being optimised, the frozen reference (SFT) policy for the KL term, the frozen reward model, and the value (critic) network used by PPO for advantage estimation. This is the source of RLHF's notorious operational cost and why GRPO (which drops the value network) and DPO (which drops the reward model and PPO entirely) have become popular alternatives.
# rlhf_components.py — the three loss terms at the heart of RLHF.
# Conceptual; production code uses TRL, OpenRLHF, verl or LLM-RLHF frameworks.
import torch
import torch.nn.functional as F
# ---------- Stage 2: Reward Model ----------
def rm_loss(reward_chosen, reward_rejected):
"""Bradley-Terry pairwise loss.
reward_chosen / reward_rejected: scalar rewards for the preferred and
rejected responses to the same prompt, shape (batch,).
"""
return -F.logsigmoid(reward_chosen - reward_rejected).mean()
# ---------- Stage 3: PPO update for RLHF ----------
def rlhf_ppo_loss(logp_new, logp_old, advantage, kl_to_ref,
clip_eps=0.2, beta_kl=0.02):
"""Single-token PPO update with KL penalty against reference policy.
logp_new : log pi_theta(token | prompt + prefix), current policy
logp_old : log pi_theta_old(token | ...), policy that generated rollout
advantage: GAE advantage from the value network
kl_to_ref: KL(pi_theta || pi_ref) at this token
"""
ratio = (logp_new - logp_old).exp()
surrogate_1 = ratio * advantage
surrogate_2 = ratio.clamp(1 - clip_eps, 1 + clip_eps) * advantage
policy_loss = -torch.min(surrogate_1, surrogate_2)
return (policy_loss + beta_kl * kl_to_ref).mean()
# ---------- Putting it together for a single rollout ----------
# For each prompt p:
# y = policy_old.generate(p) # rollout from snapshot policy
# r = reward_model(p, y) # learned reward
# v = value_model(p, y_prefix) # critic
# adv = compute_gae(r, v, lam=0.95) # advantage estimation
# logp_new = policy.log_prob(y_token | p, y_prefix)
# logp_old = policy_old.log_prob(y_token | p, y_prefix)
# kl = (logp_new - log_ref(y_token | p, y_prefix))
# loss = rlhf_ppo_loss(logp_new, logp_old, adv, kl)Practically every PPO-RLHF failure traces back to KL coefficient tuning or reward-model overconfidence. If β is too low (e.g. 0.001), the policy collapses into reward hacking — gibberish or repetition that exploits a high-scoring artefact of the RM. If β is too high (e.g. 0.3), the policy barely moves and the alignment improvement is marginal. Start at β = 0.02, monitor the running KL divergence as a leading indicator, and adjust before chasing other hyperparameters.
Variants and successors: the modern post-training landscape#
By mid-2026 the post-training stack is rarely 'pure InstructGPT-style PPO RLHF'. Different stages of the pipeline use different algorithms, depending on whether the reward is preference data or a verifiable signal, whether labels come from humans or an AI labeller, and how much PPO instability the team is willing to manage. The table below maps the variants you actually meet in production.
- DPO (Rafailov et al. 2023, arXiv:2305.18290) re-derives the RLHF optimum as a closed-form supervised loss on (preferred, rejected) pairs, eliminating the reward model and PPO loop. Two models in memory instead of four, dramatically simpler, more stable. Now the default for open-model preference-based post-training.
- GRPO (DeepSeek 2024, arXiv:2402.03300) replaces PPO's learned value network with a group baseline: sample G responses to the same prompt, advantage_i = (r_i − mean(r)) / std(r). Cuts one model from memory and behaves better with sparse rewards. Now standard for verifiable-reward RL on math and code.
- RLAIF (Bai et al. 2022 'Constitutional AI', Lee et al. 2023 arXiv:2309.00267) replaces human labellers with an AI labeller that scores responses against a written constitution. Cuts labelling cost by 10-100x and makes the alignment behaviour explicitly governed by an auditable document. Used heavily by Anthropic (Claude) and increasingly by open frontier labs.
- RLVR / RL from verifiable rewards is what produced OpenAI's o1 reasoning models and DeepSeek-R1. The reward is programmatic — math solution checker, code unit tests, agent task success — so reward hacking on the model side is harder. Combined with GRPO it has produced step-changes in math and coding benchmarks since late 2024.
- Constitutional AI (Bai et al. 2022, arXiv:2212.08073) is Anthropic's specific instantiation: SL-CAI (the model critiques and revises its own outputs against the constitution) feeds RL-CAI (AI-generated preferences against the constitution drive PPO). The detailed mechanics are in the constitutional-ai entry.
- Online / iterative DPO addresses one of DPO's structural limitations — being off-policy — by alternating policy update with fresh preference labelling. Llama 3's report describes iterative DPO; Tülu 3 (AI2, 2024) demonstrated it carefully at moderate scale.
| Variant | Where the reward comes from | How it optimises | Used by |
|---|---|---|---|
| PPO RLHF (classic) | Learned RM from human pairwise prefs | PPO with value network, KL penalty | InstructGPT, Llama 2 Instruct, early Claude |
| DPO | Same human pairwise prefs, no RM | Closed-form supervised loss on policy vs reference logprobs | Llama 3 Instruct, Qwen 3, Mistral, Gemma 3 |
| KTO | Unpaired good/bad labels (no pairs required) | Prospect-theory supervised loss | Some open-source SFT-only deployments |
| GRPO | Verifiable rewards (math correctness, unit tests) | Group-relative advantage, no value network | DeepSeek-R1, DeepSeekMath, Qwen-Math, many reasoning fine-tunes |
| RLAIF | AI labeller scores responses against a written constitution | Otherwise same as PPO or DPO | Claude family, Llama 3 (partial), heavy in 2026 open models |
| RLVR (RL from Verifiable Rewards) | Programmatic checker (math solver, unit tests, agent task success) | GRPO or PPO | OpenAI o1, DeepSeek-R1, reasoning models broadly |
| Online / iterative DPO | Fresh preference labels every round, against current policy | DPO with periodic relabel | Some open-source long-running post-training |
Where it is used today: every aligned chat model#
RLHF or one of its lineal successors is the post-training step that produced essentially every aligned chat model shipping in mid-2026. The mix varies by lab and by model, but the preference-data → alignment loop is universal.
Closed labs: GPT-4o (PPO RLHF with substantial RLAIF), o1 / o3 (RLVR + GRPO-style for reasoning, then RLHF polish), Claude Sonnet 4 / Opus 4 (constitutional AI + RLHF), Gemini 2.5 (PPO RLHF with internal variants). The exact recipes are proprietary but the public technical reports and model cards confirm the family of techniques.
Open weights: Llama 3 / 3.1 / 3.2 Instruct (SFT → DPO → iterative DPO, per Meta's technical report); Qwen 2 / 3 Instruct (SFT → DPO with KTO variants); Mistral Large 2 Instruct (DPO); Gemma 2 / 3 Instruct (SFT → DPO / ORPO); DeepSeek-V3 (SFT → DPO + GRPO for reasoning) and DeepSeek-R1 (heavy GRPO with RLVR). Phi-3 / 4 use SFT + DPO. Pure PPO RLHF in open releases is now rare — the operational cost and stability advantages of DPO are decisive.
Beyond chat: RLHF and successors are used wherever preference signal exists. Tool-use agents are increasingly RL-tuned against task-success rewards. Image generation models use RLHF-like fine-tuning on aesthetic preference labels. Code completion models use RLVR against unit-test pass rates. The pattern generalises whenever 'better' is easier to judge than to specify.
Yobitel customers running RLHF, DPO or GRPO post-training on NeoCloud typically book multi-week H100 or H200 reservations; the four-model PPO memory footprint and the rollout-throughput requirement drive the cluster shape more than the model size does. The aligned-model artefacts that come out of those runs deploy back to Yobibyte as private endpoints, alongside the upstream open-weights Instruct models the catalogue already publishes.
Trade-offs and known limitations#
RLHF has a well-catalogued set of failure modes. Knowing them in advance is the difference between shipping an aligned model and shipping one that has been over-trained on a flawed reward signal.
Reward hacking is the headline risk. The reward model is an imperfect predictor of the latent human-preference distribution; once the policy can model the reward function precisely, it will find inputs that maximise predicted reward without maximising actual preference — repetitive phrases that the RM accidentally rewarded in training, particular formatting tricks, sycophantic agreement. Mitigations: KL penalty tuning, periodic reward-model retraining on fresh data, ensemble reward models, monitoring of trajectory KL divergence, and human spot-checks of the policy's outputs against the RM's scores.
Mode collapse is the long tail of reward hacking. The policy concentrates probability mass on a small set of high-reward outputs and loses diversity. Across the dataset, every response starts with 'Certainly! I'd be happy to help you with that...' (or whatever the RM's local maximum is). Mitigations: entropy bonus in the PPO loss, varied prompt distribution during rollouts, periodic temperature scheduling to encourage exploration.
Sycophancy — the model tells the user what they want to hear rather than what is true (Perez et al. 2022) — emerges when labellers preferred agreeable responses in the preference data. Mitigations: deliberately include 'I don't know' and 'You are mistaken' in SFT and preference data; train against adversarial sycophancy probes (e.g. SycophancyEval); use constitutional principles that explicitly value honesty over agreement.
Length bias is a specific RM artefact: many RMs learn to prefer longer responses because in training data thoroughness correlated with quality. Once the policy learns this, it generates unnecessarily long responses. Mitigations: length-normalised reward (subtract a length term in the RM training), explicit prompts in evaluation comparing equally long responses, length-controlled DPO variants like SimPO.
Alignment tax — RLHF can degrade base-model capabilities (knowledge recall, reasoning) while improving alignment. Documented in InstructGPT and visible in many open-model evals where the Instruct version underperforms the base on raw academic benchmarks. Mitigations: stronger SFT mixing, weighted multitask losses combining RLHF with continued pretraining, smaller KL coefficient β, KL regularisation on a logprob-mixing reference.
Distributional fragility — RLHF only constrains behaviour on prompts similar to those in the preference dataset. On novel prompts (out-of-distribution requests, multilingual edges, very-long-context), behaviour may be unaligned. Mitigations: explicit coverage of edge cases in the preference dataset, red-teaming, automated jailbreak resistance training.
Operational cost — four-model PPO RLHF at frontier scale requires dedicated infrastructure: high-throughput inference (vLLM or SGLang) feeding the trainer, distributed PPO across multiple nodes, hundreds of GPUs running for weeks. A frontier RLHF run can consume tens of thousands of H100-days. This is the largest single reason DPO (two models, supervised loss) has displaced PPO in open-model post-training.
Practical implementation notes#
Libraries that implement RLHF and successors in 2026: TRL (Hugging Face) is the canonical reference implementation, supporting SFT, RM training, PPO, DPO, KTO, ORPO and GRPO; OpenRLHF (OpenLLMAI) focuses on distributed PPO and GRPO at scale, with vLLM rollouts integrated; verl (Volcano Engine) is similar; LLaMA-Factory wraps many of these for end-to-end workflows; Axolotl supports SFT and DPO with the YAML config approach. For frontier-scale RL training, Megatron-LM + DeepSpeed-Chat or proprietary stacks (OpenAI, Anthropic, DeepMind) dominate.
Data is the headline operational issue. A frontier RLHF preference dataset typically contains 50,000-1,000,000 pairwise comparisons collected over months by carefully vetted labellers, with detailed annotation guidelines, multi-stage quality control, and per-labeller calibration. Cost ranges from $1-$5 per high-quality preference pair, putting frontier preference data collection at $50,000-$5 million per generation of a model. This cost is one of the most compelling reasons for RLAIF — AI-generated labels at $0.01-$0.10 per pair shift the economics by 10-100x, even after accounting for AI labeller compute.
Reward model quality determines the alignment ceiling. RMs should be trained on substantially the same distribution they will be evaluated on (don't train on Reddit and evaluate on customer-support prompts); they should be regularly audited against held-out human-preference test sets to track drift; ensemble of multiple RMs trained on subsets of the data is a standard variance-reduction trick. Reward bench (allenai/reward-bench) is the standard public benchmark for RM quality across categories.
PPO hyperparameters that matter: KL coefficient β (start 0.02), clip ratio 0.2, value loss coefficient 0.5, entropy bonus 0.01 (RLHF) or 0 (RLVR), generations per prompt 4-8 (PPO) or 8-64 (GRPO), batch size 256-1024 prompts, learning rate 1e-7 to 1e-6 (very small — the policy is already pretrained). PPO is sensitive to all of these, and the right values depend on RM quality, prompt distribution and model size; budget for hyperparameter sweeps.
Eval beyond rewards: never declare an RLHF run successful from reward curves alone. Run AlpacaEval 2.0, Arena-Hard, MT-Bench and your own task-specific evals on policy checkpoints throughout training. Run red-team probes (HarmBench, AdvBench) before shipping. Track raw-model capability on MMLU, GSM8K, HumanEval to detect alignment tax. Reward going up while AlpacaEval going down is the classic 'you are reward hacking' diagnostic.
Model card and disclosure discipline: document the preference data sources, labeller demographics where ethically appropriate, the constitution (if RLAIF is used), the alignment failures discovered in red-teaming, and the residual behaviours after mitigation. The UK NCSC AI Cyber Security Code of Practice (2025) cites alignment-evaluation disclosure as a baseline expectation; the EU AI Act (in force August 2026 for general-purpose models) requires it for any model placed on the EU market.
If you are starting a new post-training pipeline in 2026 and your preference data is human-annotated pairs: start with DPO, not PPO. Same data, two models in memory instead of four, weeks of engineering saved on PPO stability. If you have verifiable rewards (math, code, agents): start with GRPO. Reach for full PPO RLHF only when you have specific reasons (research replication, mixed reward signals, a reward model that's hard to express as a closed-form preference).
Where RLHF fits in the Yobitel stack#
Yobitel does not run a public-facing RLHF training service — alignment training is a research-grade workload that customers typically perform on their own pipelines. The Yobitel GPU Cloud provides the H100, H200 and B200 infrastructure that customer RLHF runs need; the InferenceBench catalogue documents serving performance for the resulting aligned models (Llama 3 Instruct, Qwen 3 Instruct, etc.) so teams can compare deployment options.
On the application side, Yobitel's first-party AI applications (MediQuery and the broader AI Applications Suite) consume aligned open-weights models that have already been RLHF-trained upstream. The platform does not modify their alignment; it adds compliance scaffolding (NCSC, GDPR, sector-specific) around the deployment.
References
- Deep Reinforcement Learning from Human Preferences (Christiano et al., 2017) · arXiv
- Training Language Models to Follow Instructions with Human Feedback (InstructGPT, Ouyang et al., 2022) · arXiv
- Learning to Summarize from Human Feedback (Stiennon et al., 2020) · arXiv
- Direct Preference Optimization (Rafailov et al., 2023) · arXiv
- Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022) · arXiv
- DeepSeekMath: Pushing the Limits of Mathematical Reasoning (GRPO, Shao et al., 2024) · arXiv
- Llama 3 Technical Report (post-training section) · arXiv
- TRL: Transformer Reinforcement Learning Library (Hugging Face) · Hugging Face