Robust Aggregation of Discordant Annotations via Trimmed Likelihood
Robust Aggregation of Discordant Annotations via Trimmed Likelihood
1. Introduction
Four annotators rate a response as "helpful"; one rates it as "harmful." The minority annotator may be a careful reader who noticed a subtle problem, or may be a contractor who clicked the wrong button, or may be running a sloppy LLM rater that hallucinated context. Majority vote treats all three cases identically, accepting four-vs-one as a confident win for "helpful." Dawid-Skene-style EM accounts for per-annotator reliability but, because it weighs every label, is still vulnerable to a small contamination of adversarial annotators.
We propose using a trimmed likelihood that explicitly removes the worst-fitting fraction of annotator-item observations before estimating the consensus label. The trimming fraction is data-driven, chosen via a goodness-of-fit test on the residual labels.
2. Background
The Dawid-Skene model [Dawid & Skene 1979] posits per-annotator confusion matrices and latent true labels , and finds the MLE via EM. The estimator's breakdown point is for annotators per item — a single bad annotator can move the estimate.
Trimmed likelihood estimators [Neykov & Mueller 2003] generalize the trimmed mean to parametric models: maximize the likelihood over the best-fitting fraction of observations. Their breakdown point is , and they are computable via a re-weighted EM in many cases.
3. Method
Let be annotator 's label on item , and let be the latent true label. The Dawid-Skene log-likelihood is
{k, y{ia}}
where are class priors and is the probability annotator reports when truth is . The trimmed log-likelihood replaces the outer sum by
where is the set of observations with highest current log-likelihood. We optimize by alternating: (a) E-step on ; (b) M-step on over the trimmed set; (c) recompute .
3.1 Choosing
We choose from a grid by selecting the smallest such that the residual deviance of the trimmed model passes a goodness-of-fit test at . This balances bias (smaller ) against fit (larger ).
4. Experiments
We evaluate on 9 annotation tasks spanning sentiment, NLI, code-quality, and safety judgments. Each item has 3-7 labels, totaling 41,200 (item, annotator) observations across 1,200 annotators. Expert-adjudicated ground truth is available for 18% of items as a held-out evaluation set.
| Task | Items | Majority | DS-EM | Trimmed (ours) | Selected |
|---|---|---|---|---|---|
| sentiment-easy | 800 | 92.4 | 93.7 | 94.0 | 0.01 |
| sentiment-hard | 800 | 71.8 | 76.2 | 79.4 | 0.05 |
| nli-snli-subset | 600 | 81.2 | 84.5 | 87.6 | 0.05 |
| code-quality | 1,200 | 64.1 | 70.3 | 75.0 | 0.10 |
| safety-binary | 2,400 | 86.5 | 88.1 | 90.4 | 0.05 |
| safety-multi | 1,200 | 73.7 | 76.8 | 80.2 | 0.10 |
| factuality | 1,400 | 68.4 | 72.1 | 76.8 | 0.10 |
| coherence | 600 | 79.3 | 81.0 | 81.7 | 0.02 |
| helpfulness | 1,800 | 70.2 | 73.4 | 77.1 | 0.05 |
| Mean | 76.4 | 79.6 | 82.5 | 0.06 |
Over the 9 tasks, trimmed likelihood agrees with expert truth 82.5% of the time, vs. 79.6% for DS-EM and 76.4% for majority vote.
The selected correlates with task difficulty: easy sentiment tasks need almost no trimming; code-quality and factuality, where weak annotators are more harmful, select 10%.
def trimmed_em(y, n_classes, epsilon, max_iter=100):
theta = init_confusion(y, n_classes)
z = majority_init(y, n_classes)
for _ in range(max_iter):
# log-likelihood of each (i, a) observation
ll = obs_loglik(y, z, theta)
thresh = np.quantile(ll, epsilon)
mask = ll >= thresh
z = e_step(y, theta, mask)
theta = m_step(y, z, mask)
return z, theta, mask5. Annotator Auditing
The set of trimmed observations is itself useful. An annotator whose labels are trimmed at a rate substantially above their cohort average is a candidate for re-training or removal. On the code-quality task, 14 of 240 annotators had trimming rates above the cohort median; manual review of their work flagged 11 of the 14 as low-quality, a precision of 79%.
6. Discussion and Limitations
- The estimator's worst case occurs when adversarial annotators correlate strongly across items (e.g., a single Mechanical Turk fraud farm). Single-pass trimming sees them as "consistent" within their cluster and may not flag them. A clustering pre-pass can address this.
- The goodness-of-fit pre-test for is a heuristic. Cross-validated likelihood is principled but ~10x more expensive.
- The method assumes annotator confusion matrices are stable across items in the dataset. Time-varying or topic-varying reliability is left to future work.
7. Conclusion
Trimming makes label aggregation robust to a controllable fraction of adversarial or grossly miscalibrated annotators, at small cost on "clean" tasks and substantial benefit on dirty ones. The trimming set offers an additional auditing signal that majority-vote pipelines do not provide.
References
- Dawid, A. P., & Skene, A. M. (1979). Maximum likelihood estimation of observer error-rates using the EM algorithm.
- Neykov, N., & Mueller, C. H. (2003). Breakdown point and computation of trimmed likelihood estimators in generalized linear models.
- Whitehill, J. et al. (2009). Whose vote should count more: optimal integration of labels from labelers of unknown expertise.
- Raykar, V. C. et al. (2010). Learning from crowds.
- Sheshadri, A., & Lease, M. (2013). SQUARE: A benchmark for research on computing crowd consensus.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.