Self-Normalized Confidence Intervals for Reward Margins under Heavy Tails
Self-Normalized Confidence Intervals for Reward Margins under Heavy Tails
1. Motivation
A paper claims that model B beats model A by in mean reward, with a 95% t-interval of . The reader concludes the difference is significant. But what if the underlying per-prompt reward differences are heavy-tailed? The t-interval's coverage guarantee depends on a normality assumption that real reward distributions routinely violate: in our datasets, 0.5% of prompts produce reward differences exceeding 5 standard deviations of the rest of the data.
We propose using a self-normalized confidence interval, derived from the moderate-deviation theory of Pena, Lai, and Shao. The interval is wider than the t-interval by 10-20% on typical evaluation data, but its coverage is robust to tail behavior.
2. Setup
Let be i.i.d. reward differences (model B reward minus model A reward on prompt ), and let be the population reward margin. The sample mean and variance are
The self-normalized statistic is
The t-interval inverts an assumed Gaussian distribution for . The self-normalized interval inverts a Cramer-type tail bound that holds under finite second moment alone.
3. The Bound
Following [Pena, Lai & Shao 2009], for any ,
where is bounded by a function of the empirical third absolute moment. Inverting this for at the -quantile yields the half-width of the confidence interval:
For and this gives — about 8% wider than the Gaussian quantile.
4. Simulation Study
We generate synthetic reward differences from three families:
- Gaussian (control): .
- Student-t(df=3): scaled to have mean .
- Pareto-shifted: , has tail index .
For each, , and we run 10{,}000 replications.
| Distribution | t-interval coverage | Self-norm coverage |
|---|---|---|
| Gaussian | 94.9% | 94.8% |
| Student-t(3) | 89.2% | 94.3% |
| Pareto, | 84.1% | 94.7% |
The t-interval undercovers on heavy tails by 5-11 percentage points; the self-normalized interval is robust.
5. Real-Data Evaluation
We compute both intervals on three evaluation suites: helpfulness-prefs (n=12,000), code-tasks (n=4,200), and safety-redteam (n=8,400).
| Suite | t half-width | self-norm half-width | inflation | |
|---|---|---|---|---|
| helpfulness-prefs | 0.041 | 0.011 | 0.013 | +18.2% |
| code-tasks | 0.027 | 0.018 | 0.020 | +11.1% |
| safety-redteam | 0.054 | 0.014 | 0.016 | +14.3% |
Width inflation is 11-19%. In all three cases the conclusion (B significantly better than A) survives, but on a fourth suite (style-transfer; not shown) the t-interval excluded zero while the self-normalized interval did not — exactly the regime where the t-interval is least trustworthy.
def self_normalized_ci(x, alpha=0.05):
n = len(x)
mean = x.mean()
s = np.sqrt((x**2).mean() - mean**2)
third = np.abs(x - mean).mean()**3
c_n = third / s**3
# solve x^2 / (2(1 + c_n x/sqrt(n))) = log(2/alpha)
L = np.log(2 / alpha)
# quadratic in x: x^2 - 2 L c_n / sqrt(n) x - 2 L = 0
a, b, c = 1.0, -2 * L * c_n / np.sqrt(n), -2 * L
x_alpha = (-b + np.sqrt(b**2 - 4*a*c)) / (2*a)
half = s / np.sqrt(n) * x_alpha
return mean - half, mean + half6. Discussion and Limitations
The self-normalized bound assumes the data are i.i.d.; it does not handle dependence (e.g., evaluations sharing a prompt template). For mildly dependent data, a block-bootstrap variant of the self-normalized statistic is appropriate but introduces a tuning parameter (block length).
The constant in the moderate-deviation bound is loose. Sharper non-asymptotic constants are available [Bercu & Touati 2008] at the cost of more involved estimation. We chose the simplest form that yields a closed-form interval.
The approach assumes finite second moment. If the per-prompt reward distribution is so heavy-tailed that the variance is undefined, no -rate inference for the mean is possible, and one should report a quantile-based summary instead.
7. Conclusion
Reward-margin claims should be reported with intervals robust to heavy tails. The self-normalized interval is a one-line code change and a 10-20% honest tax on width. We recommend it as the default for evaluation-suite reports.
References
- Pena, V. H., Lai, T. L., & Shao, Q.-M. (2009). Self-Normalized Processes: Limit Theory and Statistical Applications.
- Bercu, B., & Touati, A. (2008). Exponential inequalities for self-normalized martingales.
- Catoni, O. (2012). Challenging the empirical mean and empirical variance.
- Lugosi, G., & Mendelson, S. (2019). Mean estimation and regression under heavy-tailed distributions.
- Romano, J. P., & Wolf, M. (2000). A more general central limit theorem for m-dependent random variables.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.