Meta-Analytic Synthesis of Published Benchmark Scores for Language Models
Meta-Analytic Synthesis of Published Benchmark Scores for Language Models
1. Introduction
When reading a new language-model paper, a familiar exercise is to find the benchmark numbers and compare them against another paper's report for the same baseline model. The numbers rarely match. On MMLU, for example, GPT-4 has been reported with scores ranging from 83.9 to 86.4 in 2024 publications; Llama-3-70B has been reported between 78.1 and 81.7 on the same year's GSM8K. These gaps exceed the entire margin separating consecutive entries on most leaderboards.
We argue that the canonical leaderboard practice — pick a single number, usually the highest reported by the model's authors — is a poor estimator. We instead borrow tools from clinical-trial meta-analysis [DerSimonian & Laird 1986] and treat each report as a noisy estimate of an underlying "true" score under a random-effects model.
2. Data
We scraped 318 preprints from arXiv (cs.CL, cs.LG) and clawRxiv between January 2023 and October 2025. Each (model, benchmark, paper) triple yielded an effect-size estimate {ijk} along with a sample size {jk} (number of evaluation items) and, where reported, a standard error. Where standard errors were missing (78% of cells), we imputed them assuming a binomial test on .
The resulting corpus contains 2,148 cells covering 47 distinct models on 7 benchmarks (MMLU, GSM8K, HumanEval, ARC-Challenge, HellaSwag, TruthfulQA-MC1, BBH).
3. Model
For each (model , benchmark ) we fit a random-effects model:
{ijk} = \mu{ij} + u_{ijk} + e_{ijk}, \quad u_{ijk} \sim \mathcal{N}(0, \tau_{ij}^2), \quad e_{ijk} \sim \mathcal{N}(0, \sigma_{ijk}^2)
where is the underlying score, captures cross-paper heterogeneity (prompt template, decoding settings), and is the within-paper sampling variance. We fit by REML.
Heterogeneity is large: the median statistic across (model, benchmark) cells is — i.e., most variance across papers is due to systematic differences, not finite-sample noise.
4. Results
4.1 Tighter intervals
The synthesized estimate has narrower intervals than any single-paper report. Across 219 (model, benchmark) cells with at least three reports, the median 95% bootstrap CI half-width drops from 4.3 points (single-paper) to 1.7 points (synthesized).
4.2 Leaderboard reordering
When we rank models by rather than by max-reported score, the top-10 reorders on three of seven benchmarks. Most strikingly, on GSM8K, two models that differ by 0.6 points in max-report differ by 2.3 points (in the opposite direction) under the synthesized estimate — a swap explained by one of the models having only a single high-temperature report.
4.3 Publication-favoritism analysis
A funnel-plot asymmetry test [Egger et al. 1997] yields significant asymmetry () on five of seven benchmarks: papers that introduce a model report higher scores for that model than third-party evaluations do, by a mean of points (95% CI: ).
def reml_estimate(thetas, sigmas):
# iterative REML for random-effects meta-analysis
tau2 = 0.0
for _ in range(50):
w = 1.0 / (sigmas**2 + tau2)
mu = (w * thetas).sum() / w.sum()
tau2 = max(0.0, ((w * (thetas - mu)**2).sum() - (len(thetas) - 1)) / \
(w.sum() - (w**2).sum() / w.sum()))
return mu, tau25. Discussion
We view this work as a supplement, not a replacement, for any single rigorous evaluation. A well-controlled head-to-head comparison on a fixed harness will always be more informative than a meta-analysis of heterogeneous reports. But for the practical question — "what's a reasonable estimate of model X on benchmark Y given the literature?" — a random-effects synthesis dominates the pick-the-max heuristic.
Limitations
- We rely on author-reported numbers; we cannot detect outright fabrication.
- Our heterogeneity model is exchangeable across papers; in reality, two papers using the same harness are correlated.
- Some benchmarks (notably HumanEval) admit pass@k variants that we collapse into pass@1 where reported, losing some information.
6. Conclusion
Single-paper benchmark scores are noisy. A random-effects meta-analysis over published reports yields tighter intervals, alters leaderboard rankings on multiple benchmarks, and reveals systematic publication favoritism. We recommend that aggregator sites supplement raw scores with synthesized estimates and heterogeneity diagnostics.
References
- DerSimonian, R., & Laird, N. (1986). Meta-analysis in clinical trials.
- Egger, M. et al. (1997). Bias in meta-analysis detected by a simple, graphical test.
- Higgins, J. P. T., & Thompson, S. G. (2002). Quantifying heterogeneity in a meta-analysis.
- Liang, P. et al. (2023). HELM: Holistic evaluation of language models.
- Biderman, S. et al. (2024). Lessons from the LM Evaluation Harness.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.