Calibration Curves of LLM-as-Judge Across Model Sizes
Calibration Curves of LLM-as-Judge Across Model Sizes
1. Introduction
"LLM-as-judge" — using an LLM to score the output of another LLM — has become standard for open-ended evaluation [Zheng et al. 2023]. Most analyses focus on agreement with human raters; calibration of the judge's probability scores has received much less attention. A judge that says "score 0.9" is useful only insofar as outputs it scores 0.9 are correct roughly 90% of the time.
This paper is an empirical calibration study, not a method paper. Our contribution is data and a recalibration recipe.
2. Setup
2.1 Judges
We used nine judges spanning roughly four orders of magnitude in capability: a 1.3B small open model, four mid-sized 7B-34B models, three frontier 70B+ models, and one 600B-class API model.
2.2 Tasks
- Summarization faithfulness (8,200 items)
- Code-explanation correctness (4,300 items)
- Argument-quality scoring (5,500 items)
Ground truth was 3-rater majority among domain experts, with disputes resolved by a fourth rater (Cohen's = 0.81 between expert raters before adjudication).
2.3 Calibration metric
We used Expected Calibration Error
over equal-mass bins.
3. Results
3.1 Headline numbers
| Judge size | ECE | Brier | Top-band over-confidence |
|---|---|---|---|
| 1.3B | 0.27 | 0.31 | +0.22 |
| 7-13B | 0.18 | 0.24 | +0.18 |
| 34-70B | 0.09 | 0.17 | +0.06 |
| Frontier (600B) | 0.04 | 0.13 | +0.03 |
Smaller judges' over-confidence is concentrated in the 0.8-0.95 band: they emit 0.9 routinely but are correct only 72% of the time.
3.2 The "too-easy" regime
Frontier judges are well-calibrated overall but exhibit a bias on items that are visibly easy: they emit 0.99 when 0.95 would be honest. ECE on the easy decile alone is 0.07, vs. 0.03 globally. We hypothesize this is RLHF-induced agreeability.
3.3 Recalibration
We applied isotonic regression on a 30% calibration split and re-measured on the held-out 70%:
| Judge size | Pre-cal ECE | Post-cal ECE |
|---|---|---|
| 1.3B | 0.27 | 0.11 |
| 7-13B | 0.18 | 0.06 |
| 34-70B | 0.09 | 0.03 |
| Frontier | 0.04 | 0.03 |
Isotonic recalibration is most useful for small and mid-sized judges; for frontier judges the marginal gain is within noise.
from sklearn.isotonic import IsotonicRegression
cal = IsotonicRegression(out_of_bounds="clip").fit(judge_probs, gold_correct)
calibrated = cal.transform(judge_probs)4. Discussion
Practical implication 1. A small judge with isotonic recalibration is competitive with an uncalibrated mid-size judge for downstream selection use cases — and it is much cheaper.
Practical implication 2. Reporting raw judge confidence as if it were a probability is a real source of overstated benchmark precision. We recommend that benchmark papers explicitly state whether judge confidences have been calibrated.
5. Limitations
Our ground truth is human-adjudicated and itself imperfect; on the argument-quality task, expert disagreement was nontrivial (initial ). Calibration estimates inherit this noise. We also evaluated only single-pass judges; calibration of multi-pass deliberative judges (e.g., debate setups) is left for future work.
6. Conclusion
LLM-as-judge calibration improves substantially with scale and is largely repaired by isotonic regression at small/mid sizes. Frontier judges are mostly fine on average but exhibit a residual "too-easy" optimism. We release the 38,400-decision dataset to enable replication.
References
- Zheng, L. et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
- Guo, C. et al. (2017). On Calibration of Modern Neural Networks.
- Niculescu-Mizil, A. and Caruana, R. (2005). Predicting Good Probabilities with Supervised Learning.
- Min, S. et al. (2024). Re-evaluating Evaluators.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.