Random-Effects Models of Inter-Annotator Disagreement in Preference Data
Random-Effects Models of Inter-Annotator Disagreement in Preference Data
1. Introduction
Pairwise preference data is now the dominant supervision signal for instruction-tuned language models. The conventional pipeline aggregates labels via majority vote (or, for single-annotator items, accepts the lone label as ground truth) and minimizes a Bradley-Terry loss against the consolidated target. This pipeline silently treats disagreement as i.i.d. Bernoulli noise, but disagreement is structured: some prompts are intrinsically ambiguous, and some annotators are systematically lenient or harsh.
We propose modeling disagreement explicitly with a two-way random-effects model. Each comparison — item vs. item judged by annotator — generates a latent utility difference , where captures annotator-specific bias and captures item-pair-specific ambiguity. We show that posterior estimates of the latent reward are better-calibrated and that downstream reward models trained on these posteriors generalize better than those trained on majority-vote labels.
2. Background
The Bradley-Terry-Luce (BTL) model assumes with a single global noise scale. Generalizations that admit per-judge severity have been studied in psychometrics for decades [Linacre 1994; Patz & Junker 1999], but their adoption in modern preference learning has been sparse, with [Wang et al. 2024] being a notable recent exception.
3. Method
We place priors , , and . The likelihood for an observed preference is
We fit via mean-field VI with a Gaussian guide and learn by Type-II maximum likelihood. The resulting posterior over each item's latent reward is then used as a soft label for a downstream reward model.
def elbo(theta_mu, theta_logvar, eta_mu, eta_logvar, comparisons):
kl = gaussian_kl(theta_mu, theta_logvar, 0.0, log_tau_theta_sq) \
+ gaussian_kl(eta_mu, eta_logvar, 0.0, log_tau_eta_sq)
ll = 0.0
for i, j, a, y in comparisons:
diff = theta_mu[i] - theta_mu[j] + eta_mu[a]
var = theta_logvar[i].exp() + theta_logvar[j].exp() + eta_logvar[a].exp() + tau_eps_sq
ll = ll + bernoulli_probit_ll(y, diff, var)
return ll - kl4. Results
We evaluate on three corpora: HH-RLHF (160K pairs, 142 annotators), UltraFeedback-Pro (480K pairs, 311 annotators), and an internal 1.2M-pair set (727 annotators). We compare against majority-vote BTL and against the soft-label baseline of [Cheng et al. 2025].
| Dataset | Majority-BTL | Soft-Label | Random-Effects (ours) |
|---|---|---|---|
| HH-RLHF | 71.3% | 72.1% | 74.7% |
| UltraFeedback-Pro | 68.9% | 69.6% | 72.6% |
| Internal-1.2M | 73.8% | 74.4% | 76.4% |
Gains (held-out reward-model accuracy) range from 2.0 to 3.7 points. Calibration improves correspondingly: ECE drops from 0.063 to 0.024 on HH-RLHF.
The annotator-severity posterior is informative on its own. On the internal corpus, 41 of 727 annotators have , indicating systematic bias. A further 28 annotators have posterior variance (i.e., barely shrunken from the prior), suggesting their labels carry little information; flagging these for re-training reduced label cost in a follow-up batch by an estimated 9.1%.
5. Discussion and Limitations
The model assumes additive annotator bias on the latent scale. We do not model interaction effects (e.g., annotator is biased only on prompts of type ). Preliminary experiments with a low-rank interaction term () yield small additional gains (+0.4 points) at the cost of identifiability headaches.
Our VI scheme is mean-field; full-covariance variants are tractable up to roughly 200K items but become memory-bound beyond that. For larger corpora, structured VI or HMC on a thinned subset may be preferable.
6. Conclusion
Disagreement in preference data is not noise to be averaged away — it is signal that, when modeled, yields better-calibrated rewards and identifies low-quality annotators. The random-effects formulation is a small change to the standard pipeline and pays for itself in downstream accuracy.
References
- Linacre, J. M. (1994). Many-Facet Rasch Measurement.
- Patz, R. J., & Junker, B. W. (1999). A straightforward approach to Markov chain Monte Carlo methods for item response models. Journal of Educational and Behavioral Statistics.
- Wang, X. et al. (2024). Annotator-aware preference modeling.
- Cheng, Y. et al. (2025). Soft labels from rater agreement for RLHF.
- Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.