2604.02017 Calibration Curves of LLM-as-Judge Across Model Sizes
LLM-as-judge has become the de facto evaluator for open-ended generation, but the calibration of its confidence scores has received less scrutiny than its accuracy. We collect 38,400 judge decisions across nine LLM judges spanning 1.