Power Analysis for Pairwise Model Comparisons on Reasoning Benchmarks
Power Analysis for Pairwise Model Comparisons on Reasoning Benchmarks
1. The Underpowered Comparison Problem
A recurring pattern in LLM papers reads roughly: "Our method scores 71.4% versus the baseline's 70.6% ()." Even granting the -value, two questions are usually unanswered: (1) given the benchmark size, what effect could plausibly be detected? and (2) is the claimed difference robust to evaluator stochasticity?
We formalize the first question — statistical power — for the paired binary-correctness setting that dominates reasoning evaluation, and provide closed-form recommendations.
2. Setup
Let and be the correctness indicators on example for models and respectively. Let , , and define the difference . The paired estimator is
Its variance is
where . Equivalently with the per-example correlation :
The positive correlation is typically large in practice — both models tend to get easy problems right and hard problems wrong — and reduces required sample sizes substantially compared to the unpaired case.
3. Sample-Size Formula
For a two-sided test at significance level and power , the required is
With (), , :
| 0.0 | 65{,}920 | 16{,}480 | 4{,}120 |
| 0.4 | 39{,}552 | 9{,}888 | 2{,}472 |
| 0.6 | 26{,}368 | 6{,}592 | 1{,}648 |
| 0.8 | 13{,}184 | 3{,}296 | 824 |
For a typical reasoning benchmark with , only differences of about 2-3 percentage points are reliably detectable at the conventional 80% power threshold.
4. Empirical Estimates of
We estimated for 28 model pairs across MMLU, MATH, GSM8K, ARC, and BBH using publicly available per-example outputs. The distribution of is summarized below:
MMLU mean rho = 0.62 (IQR 0.55-0.71)
MATH mean rho = 0.41 (IQR 0.31-0.55)
GSM8K mean rho = 0.58 (IQR 0.48-0.70)
ARC mean rho = 0.69 (IQR 0.62-0.78)
BBH mean rho = 0.49 (IQR 0.38-0.59)Math subtasks have lower correlation, often because models err on different problems within the same difficulty band. This means power on math benchmarks is worse than on knowledge benchmarks at matched .
5. A Power Calculator
from scipy.stats import norm
def required_n(p_a, p_b, rho, alpha=0.05, power=0.8):
z_a = norm.ppf(1 - alpha / 2)
z_b = norm.ppf(power)
var_a = p_a * (1 - p_a)
var_b = p_b * (1 - p_b)
var_d = var_a + var_b - 2 * rho * (var_a * var_b) ** 0.5
delta = abs(p_a - p_b)
if delta == 0:
return float("inf")
return ((z_a + z_b) ** 2 * var_d) / (delta ** 2)We recommend running this calculator at submission time with the actual benchmark size and reporting the minimum detectable effect (MDE) alongside any pairwise comparison.
6. Re-examination of Recent Claims
We surveyed 12 papers from 2025 that reported small-effect-size claims (sub-2%) on benchmarks with . For each we computed the post-hoc detectable effect at 80% power using the empirical .
- 7 papers' claimed effects were below the MDE.
- 3 papers were borderline (within 50% of MDE).
- 2 papers had effects well above MDE and remain plausible.
This is consistent with concerns raised in [Card et al. 2020] for NLP more broadly.
7. Discussion and Limitations
Power analysis assumes a fixed underlying . Reporting power tables for a range of plausible values is more informative than a single MDE. We also assume binary correctness; for graded scores (e.g., F1) similar formulas apply with the appropriate variance estimate.
The correlation must be estimated from data; using a plug-in estimate slightly understates required , with a finite-sample correction available via the Fisher transformation.
8. Conclusion
Most reasoning benchmarks are too small to detect the differences they are routinely used to claim. Reporting MDE alongside accuracy is a near-zero-cost intervention that would substantially raise the rigor of pairwise comparisons.
References
- Card, D. et al. (2020). With little power comes great responsibility.
- Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences.
- McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions.
- Dror, R. et al. (2018). The hitchhiker's guide to testing statistical significance in NLP.
- Hendrycks, D. et al. (2021). Measuring massive multitask language understanding.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.