TL;DR
- Constitutional AI (Bai et al., 2022, arXiv:2212.08073) trains a model to be helpful, honest and harmless using a written set of principles ('the constitution') interpreted by another AI model rather than direct human safety labels.
- Two stages: SL-CAI (Supervised Learning) — the model critiques and revises its own outputs against the constitution to produce a fine-tuning dataset. RL-CAI — preference labels are generated by an AI against the constitution and used for RLHF-style training.
- Cuts the human safety-labelling burden, makes alignment behaviour explicitly governed by a written document, and scales to behaviours where human labelling is expensive or harmful.
- Anthropic's Claude models are trained with Constitutional AI as a central component; the technique has been adopted in modified form by many open-source alignment pipelines (Llama 3 used AI-generated preference data heavily).
The Problem CAI Addresses#
Standard RLHF relies on human labellers to compare model responses on safety-relevant prompts. This is slow, expensive, and exposes labellers to potentially harmful content. It is also opaque: the model's safety behaviour emerges from the aggregate of labeller judgements, with no explicit articulation of the underlying principles.
Constitutional AI flips the script. Articulate the principles explicitly as a written constitution (e.g. 'Choose the response that is most helpful, harmless, and honest'). Use an existing model to critique candidate responses against the constitution and choose the better one. Train on that synthetic feedback.
The Two-Stage Pipeline#
Stage 1 — Supervised Learning (SL-CAI). The model critiques and revises its own outputs against the constitution, producing a fine-tuning dataset of safer responses.
Stage 2 — Reinforcement Learning (RL-CAI). AI-generated preference labels (under the constitution) drive an RLHF-style update of the policy.
- Start with a helpful-but-not-yet-safe model.
- Prompt it with potentially harmful queries; it produces a candidate response.
- Prompt the same model to critique its response against a randomly sampled constitutional principle, then revise.
- Repeat the critique-and-revise loop a few times.
- Fine-tune the model on the final revised responses (SL-CAI complete).
RL-CAI in Detail#
The output is a model whose safety behaviour was shaped by AI-generated preferences derived from an explicit, auditable document — not by aggregated human judgements.
- Sample pairs of responses from the SL-CAI model on a safety-relevant prompt.
- Prompt a separate AI model to choose the better response under the constitution, producing a synthetic preference label.
- Train a preference model on these labels (same form as RLHF's reward model).
- Use PPO (or DPO) to fine-tune the policy against this preference signal.
The Constitution#
Anthropic's published constitution draws on the UN Declaration of Human Rights, Apple's terms of service, and principles from non-Western perspectives, totalling several dozen rules. Examples:
During training, principles are randomly sampled for each critique-and-revise step, so no single principle dominates and behaviour generalises across the rule set.
- 'Please choose the response that is as harmless and ethical as possible.'
- 'Compare the degree of harmfulness in the assistant responses and choose the one less harmful.'
- 'Choose the response that is more supportive of life, liberty, and personal security.'
Why It Matters#
- Auditability — the rules governing the model's behaviour are written down and inspectable, not buried in a labeller pool.
- Scalability — AI labellers cost a fraction of human ones; large preference datasets become tractable.
- Iteration — updating the constitution changes the behaviour; updating labelling guidelines requires re-labelling.
- Reduction of harm to labellers — humans no longer need to read every harmful prompt at scale.
- Generalisation — synthetic preferences capture principles consistently, whereas humans vary.
CAI is not a replacement for human judgement — the constitution itself is written by humans, and the AI labellers' alignment depends on their own training. CAI shifts where humans are needed, from per-example labelling to constitution drafting and AI-labeller auditing.
RLAIF and the Broader Trend#
Constitutional AI is a specific instance of a broader pattern called RLAIF (Reinforcement Learning from AI Feedback): use AI models to generate the preference labels that traditionally came from humans. Google's Sparrow and subsequent papers also explored this direction.
By 2024, AI-generated preference data was a major component of essentially every frontier model's post-training. Llama 3 used substantial AI-generated preference data; Qwen and DeepSeek do similar. Pure human labelling is now reserved for the highest-quality calibration sets and red-teaming.
Limits and Criticism#
The constitution can encode the values of its drafters in unexamined ways. Collective Constitutional AI (Anthropic, 2023) explored democratically sourced constitutions to mitigate this. Critics also note that CAI may produce models that articulate principles well but fail to apply them under adversarial pressure — a behavioural-versus-cognitive alignment gap.
Anthropic has continued to refine the approach: Claude's later models (Claude 3, Claude 4) use evolved versions of CAI combined with other techniques (red-teaming, automated jailbreak resistance training). The exact recipes are proprietary, but the constitutional framing remains central to Anthropic's public alignment narrative.
References
- Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022) · arXiv
- Training a Helpful and Harmless Assistant with RLHF (Bai et al., 2022) · arXiv
- Collective Constitutional AI (Anthropic, 2023) · Anthropic
- RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback (Lee et al., 2023) · arXiv