LLM-as-a-judge: evaluating LLM systems that actually scale
TL;DR
For generative output, human review doesn't scale and string-match metrics like BLEU and ROUGE don't capture quality, so I use a calibrated LLM judge scored against a rubric. The judge only earns trust once it agrees with a small human-labelled set (I report Cohen's kappa), and once its known biases — position, verbosity, self-preference — are actively mitigated. Then it goes into CI as a regression gate.
If you are shipping anything generative — a RAG assistant, a summariser, an agent — you cannot eyeball your way to quality, and BLEU/ROUGE/exact-match will lie to you. The practical answer is an LLM scoring outputs against an explicit rubric, but only after you have calibrated it against human labels and neutralised its biases. Skip the calibration step and you have built a confident random number generator.
Why don't human review and string metrics work?
Human review is the gold standard for quality and the worst standard for throughput. A reviewer manages maybe 50–100 careful judgements an hour. The moment you want to test 2,000 prompts on every model bump, human-only review is dead. It is also inconsistent: the same reviewer disagrees with themselves across days, and two reviewers disagree with each other constantly.
String-match metrics scale fine but measure the wrong thing. BLEU and ROUGE reward n-gram overlap with a reference. For open-ended generation there are dozens of equally good answers that share almost no tokens with your reference, and plenty of token-overlapping answers that are wrong, contradictory or unsafe. Exact-match is even blunter — it only works when the output space is genuinely closed (a label, a number, a single SQL query you can execute and diff).
So the field landed on a middle path: use a strong LLM as the evaluator. It scales like a metric and reasons about meaning like a human. The catch is that it is a noisy, biased instrument, and you have to treat it like one.
Pointwise scoring or pairwise comparison?
Two judging modes, different jobs.
Pointwise scores a single output in isolation against a rubric — "rate faithfulness to the source from 1 to 5". This is what you want for absolute quality gates and for tracking a metric over time. The weakness: LLMs are bad at stable absolute calibration. Ask for a 1–10 score and you get clustering around 7–8 and drift between runs.
Pairwise shows the judge two outputs (A vs B) and asks which is better. Models are far more reliable at relative comparison than absolute scoring, so pairwise is what you use to decide "is the new prompt better than the old one?". You aggregate many pairwise verdicts into a ranking — an Elo rating or a Bradley–Terry model — to rank prompt variants or candidate models.
My rule of thumb: pairwise for ranking and A/B decisions, pointwise for absolute regression thresholds. Often I run both — pairwise to choose between candidates, pointwise to assert the winner still clears a minimum bar.
How do I build the golden set and rubric?
The golden set is the asset. The judge is just a lens over it.
Aim for 100–300 examples that actually represent production: the easy cases, the long-tail edge cases, the adversarial ones, and the categories you know matter to the business. Stratify it — don't let 80% of the set be one easy intent. Every example carries the input, ideally a reference answer, and the dimensions you care about.
The rubric must be specific enough that two humans applying it agree. "Rate quality 1–5" is useless. Break quality into named dimensions with explicit anchors:
Dimension: Faithfulness (is every claim supported by the provided context?)
5 - Every claim is directly supported by the context. No additions.
3 - Mostly supported; one minor unsupported but plausible detail.
1 - Contains a claim that contradicts or is absent from the context.
Dimension: Completeness (does it answer what was actually asked?)
5 - Fully addresses every part of the question.
3 - Addresses the main ask, misses a secondary part.
1 - Misses the central question.A G-Eval-style approach helps here: hand the judge the rubric, make it produce a short chain-of-thought against each criterion before emitting the score, and have it return structured JSON. The reasoning step measurably improves agreement with humans, and it gives you an audit trail when a verdict looks wrong.
What about the judge's biases?
LLM judges have well-documented, reproducible biases. If you don't mitigate them, your numbers are noise dressed as signal.
- Position bias. In pairwise mode, judges systematically favour whichever answer is shown first (sometimes second — it varies by model). Mitigation: run every pair twice with the order swapped and only count it as a win if the judge agrees both ways. Ties and order-flips get flagged, not silently scored.
- Verbosity bias. Judges reward longer, more elaborate answers even when the extra length adds nothing or hallucinates. Mitigation: put concision in the rubric explicitly, and watch for correlation between length and score on your golden set — if score tracks length, your judge is measuring word count.
- Self-preference bias. A model tends to rate outputs from its own family higher. If you generate with GPT-4-class and judge with the same family, you've baked in a thumb on the scale. Mitigation: judge with a different model family than the one under test, or use a panel.
Two structural mitigations stack on top:
Reference-guided grading — give the judge a known-good reference answer to grade against rather than judging in a vacuum. This anchors it and sharply cuts variance.
Panel of judges — use two or three different judge models and aggregate (majority vote, or average score). A panel reduces any single model's idiosyncratic bias and gives you a disagreement signal: when the panel splits, that example is genuinely hard and worth a human glance.
How do I know the judge is any good?
You calibrate it against humans, and you report agreement as a number. This is the step everyone skips and it is the only thing standing between you and self-deception.
Take 50–100 examples from the golden set, have a human label them carefully, then run your judge on the same set and measure how well they agree. For categorical or pass/fail verdicts, use Cohen's kappa (which corrects for agreement by chance); for ordinal scores, use a rank correlation like Spearman's or Kendall's tau.
from sklearn.metrics import cohen_kappa_score
# human and judge: pass/fail labels on the same 80 calibration examples
kappa = cohen_kappa_score(human_labels, judge_labels)
print(f"Cohen's kappa: {kappa:.2f}")
# kappa > 0.6 = substantial agreement, usable as a gate
# 0.4-0.6 = moderate; tighten the rubric before trusting it
# < 0.4 = the judge is not measuring what you think; fix the rubricA kappa around 0.6 or above is typically the bar for using the judge as a gate. Below that, the fix is almost always the rubric, not the model — vague anchors produce disagreement. Re-run calibration whenever you change the judge model, the judge prompt, or the model family under test. The judge is not a fixed instrument; it drifts when any of those change.
How does this go into CI?
This is where it pays for itself. Once the judge is calibrated, wire it in as a regression gate on every prompt or model change.
The flow: a change to a prompt template or a model version triggers a job that runs the candidate over the golden set, scores each output with the calibrated judge, and compares aggregate scores (and per-dimension scores) against the last known-good baseline. If mean faithfulness drops below threshold, or any high-severity category regresses, the pipeline fails the change.
# eval gate, runs on every PR touching prompts/ or model config
- run: python eval/run_judge.py --suite golden_set.jsonl --judge claude-haiku
- run: python eval/check_regression.py \
--baseline baselines/main.json \
--min-faithfulness 4.2 \
--max-category-drop 0.3 # fail if any category falls >0.3 vs baselineTwo things make this survivable in practice. Report per-dimension and per-category breakdowns, not one blended number — a blended score hides a faithfulness collapse behind a fluency gain. And treat threshold changes as deliberate decisions in the PR, so nobody quietly lowers the bar to make a red build green.
What does it cost, and when is it worth it?
The honest objection: judging is itself LLM inference, so a 2,000-example suite run on every PR is real money and real latency. Controls that work:
- Cheaper judge model. You rarely need the frontier model to judge. A mid-tier or small model, given a tight rubric and reference answers, often calibrates well enough. Test it on the calibration set — if its kappa holds, use it.
- Sampling. Run the full suite nightly and on release; run a stratified 200-example subset on every PR.
- Caching. Cache judge verdicts keyed by (input, output, rubric version). If the candidate output for an example is byte-identical to a previous run, you don't re-judge it. On a stable suite the cache hit rate is high.
When is the whole apparatus worth building? If you ship generative output to users and change prompts or models more than occasionally, yes — the calibration and CI work pays back the first time it catches a silent regression before a customer does. When is it overkill? If your output space is closed (classification, extraction, runnable code/SQL you can verify deterministically), use the deterministic check — it is cheaper, faster and unbiased, and an LLM judge adds nothing but cost. And never let the judge be the only signal: keep a small human review loop running on the panel's disagreements, because that is where your next rubric improvement is hiding.
Related service
AI systems architecture & LLM integration