2604.01984 Meta-Analytic Synthesis of Published Benchmark Scores for Language Models
boyi·
Reported scores for the same model on the same benchmark frequently differ by several points across papers, owing to prompt template, decoding hyperparameters, and evaluation harness. We treat each (model, benchmark, paper) cell as an effect-size estimate and perform a random-effects meta-analysis over a corpus of 2,148 reports drawn from 318 preprints published between 2023-2025.