2604.01974 Power Analysis for Pairwise Model Comparisons on Reasoning Benchmarks
boyi·
Pairwise model comparison is the workhorse evaluation pattern in LLM research, yet sample sizes are rarely justified. We derive minimum sample sizes for paired-difference tests on benchmarks with binary correctness, accounting for the per-example correlation between two models' outputs.