TL;DR
- LLM-as-judge is the practice of using a strong LLM to score, rank, or pairwise-compare outputs of another LLM, with a rubric defined in natural language.
- Formalised and characterised in "Judging LLM-as-a-Judge" (Zheng et al., 2023, arXiv:2306.05685), which introduced MT-Bench and Chatbot Arena and reported >80% agreement between strong judges (GPT-4 at the time) and human evaluators.
- Known biases include position bias (preferring the first or second of two options), verbosity bias (preferring longer answers), self-enhancement bias (judges favouring their own family), and limited-reasoning bias (errors on maths and logic).
- Mitigations are well-established: randomise position, use chain-of-thought judging, calibrate with reference answers, prefer pairwise comparison over absolute scoring for open-ended tasks, and triangulate with programmatic and human signals.
Why It Works (and When)#
Exact-match grading does not scale to free-form text. BLEU, ROUGE, and BERTScore are coarse. Human evaluation is the gold standard but slow and expensive. LLM-as-judge sits between them: strong frontier models, given a clear rubric, agree with human evaluators substantially more often than traditional metrics — the original MT-Bench paper reported >80% agreement, comparable to inter-annotator agreement between two humans.
It works best for subjective dimensions where humans agree (clarity, relevance, helpfulness) and worst for tasks the judge model itself cannot do (advanced maths, niche domain expertise, complex code review). Calibrate before deploying.
The Three Judge Modes#
- Single-answer scoring — judge sees one output and assigns a score on a defined scale. Simple but noisy; absolute scores drift between judges and prompts.
- Pairwise comparison — judge sees two outputs and picks the better one (or a tie). Lower variance than absolute scoring, the standard for preference ranking.
- Reference-guided — judge sees the candidate output and a known-good reference; scores or compares relative to it. The most reliable mode when references are available.
Known Biases#
| Bias | Description | Mitigation |
|---|---|---|
| Position bias | Prefers option in position A (or B) | Randomise order; report symmetric agreement |
| Verbosity bias | Prefers longer / more detailed answers | Constrain rubric to information content |
| Self-enhancement | Judge favours outputs from same model family | Use a different family for judging |
| Limited reasoning | Misjudges maths / logic when wrong itself | Use programmatic check for verifiable tasks |
| Style bias | Prefers Markdown, bullet points, hedged tone | Specify desired style explicitly in rubric |
Writing a Good Judge Rubric#
Rubric quality determines judge quality. A weak rubric ("Is the answer good?") produces noisy scores; a strong rubric (a numbered list of criteria with examples of pass and fail) produces reliable scores. Treat rubric authoring as prompt engineering — version-controlled, reviewed, and tested against a small calibration set with known human-graded outcomes.
You are a strict evaluator. Score the answer 1-5 on accuracy.
Criteria:
- 5: All claims directly supported by the context. No fabrication.
- 4: Mostly supported; one minor unsupported claim.
- 3: Mixed — some supported, some unsupported.
- 2: Mostly unsupported; ignores context.
- 1: Fabricated or contradicts context.
Think step by step before scoring. List each claim and check it.
Output: <reasoning>...</reasoning><score>N</score>Calibration#
Before trusting a judge in CI, calibrate it. Build a small (50-200 example) set with hand-graded human scores. Run the judge over the same set. Compute agreement (Cohen's kappa or simple correlation). If agreement is poor, iterate on the rubric or change the judge model. Recalibrate when you change either.
Calibration is also how you defend the metric in a regulated environment. "Our judge agrees with our reviewers 87% of the time on a 200-example calibration set, last reviewed YYYY-MM-DD" is a far more defensible claim than "GPT-4 says it's a 4.2".
Never use the same model both to generate and judge in a high-stakes evaluation. Self-enhancement bias is real and well-documented. Use a different family — Anthropic-judge for OpenAI outputs, OpenAI-judge for Anthropic outputs, or use ensembled judges across families.
Where Judges Fail#
LLM-as-judge fails predictably on tasks the judge cannot itself perform — advanced maths, formal proofs, niche specialist knowledge, code that depends on private context. For these, prefer programmatic checks: unit tests, formal verifiers, type checkers, deterministic graders. Use the judge for the dimensions a judge handles well (clarity, presence of required elements, faithfulness to context) and use programmatic tools for the dimensions it does not.
References
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena · arXiv (Zheng et al., 2023)
- Ragas Documentation — Metrics · Ragas
- DeepEval Documentation — G-Eval · Confident AI