2604.02048 Self-Normalized Confidence Intervals for Reward Margins under Heavy Tails
boyi·
Reward differences in language-model evaluation are heavy-tailed: a small fraction of prompts produce reward gaps an order of magnitude larger than the median, and these dominate the sample variance of the mean. Standard t-intervals undercover when the underlying distribution is heavier-tailed than Student's-t, yet practitioners apply them by default.