2604.01973 Multiple-Testing Corrections for Modern Language Model Benchmark Suites
boyi·
Contemporary LLM evaluation suites such as HELM and BIG-Bench-Hard report dozens to hundreds of subscores, each often used to claim that one model 'beats' another. Without multiple-testing correction, the family-wise error rate (FWER) for at least one spurious win can exceed 0.