Adaptive Stopping in Sequential A/B Tests for Model Rollouts
Adaptive Stopping in Sequential A/B Tests for Model Rollouts
1. Introduction
A team launches a candidate model variant alongside the production model in a 50/50 traffic split. A dashboard plots the cumulative win rate. Each morning the on-call engineer checks the chart; if it looks bad, they roll back; if it looks great, they ramp up; if it looks ambiguous, they wait another day. This is a sequential test with optional stopping, and the engineering reality is that the practical Type-I error rate of such procedures is somewhere between 15% and 25% — far above the nominal 5% the dashboard's confidence bands suggest.
This paper applies the always-valid (anytime-valid) confidence sequence framework of [Howard et al. 2021] to the model-rollout problem. The contribution is not new theory but a practical instantiation: closed-form bounds for the three metrics most rollouts actually monitor, plus a retrospective analysis of 17 internal rollouts.
2. Background: Confidence Sequences
A confidence sequence (CS) is a sequence of intervals such that
This is strictly stronger than a fixed-horizon CI: it permits the analyst to peek at every time step and stop whenever they like. The price is a wider interval; for sub-Gaussian data with proxy variance the half-width at time scales as
3. Three Practical Estimators
3.1 Mean reward
For scalar reward we use the empirical-Bernstein CS [Waudby-Smith & Ramdas 2024], which adapts to the empirical variance and is non-asymptotic.
3.2 Win rate
For binary win/loss outcomes we use a beta-binomial mixture-martingale CS, with mixing distribution . The width is
3.3 Latency p99
For tail latencies we apply a quantile CS via the symmetric DKW inequality, modified to admit anytime validity. Width scales as .
4. Stopping Rule
We trigger a decision when the CS for the difference-in-metric excludes zero. To balance speed against width, we set and an effective-sample-size floor of to prevent very-early stopping on noise spikes.
5. Retrospective Evaluation
We replayed 17 internal rollouts (each with traffic units) under three procedures: (a) fixed-horizon t-test at the originally-chosen end date, (b) daily-peeking t-test (with the original on-call's stop times), and (c) our always-valid CS.
| Procedure | False winners | False rollbacks | Mean stop time |
|---|---|---|---|
| Fixed-horizon t-test | 1/17 | 0/17 | day 14 (forced) |
| Daily-peek t-test | 4/17 | 1/17 | day 7.4 |
| Always-valid CS | 0/17 | 0/17 | day 8.2 |
"Truth" was determined by a long follow-up evaluation on a held-out audit slice. The always-valid procedure took 11% longer on average than the daily-peek procedure but eliminated the four false-winner cases.
6. A Worked Example
A latency-focused rollout in March 2025 showed a candidate model with a p99 improvement on day 3. The on-call team, applying daily-peek thinking, was within hours of ramping. The always-valid CS at that moment was , and held the rollout open. By day 9, the true effect had drifted to (a regression), and the rollout was correctly aborted.
def ev_bernstein_cs(rewards, alpha=0.05):
n = len(rewards)
mu_hat = np.cumsum(rewards) / np.arange(1, n+1)
var_hat = np.cumsum((rewards - mu_hat[-1])**2) / np.arange(1, n+1)
log_term = np.log(np.log(np.arange(2, n+2)) / alpha)
half_width = np.sqrt(2 * var_hat * log_term / np.arange(1, n+1)) + 7 * log_term / (3 * np.arange(1, n+1))
return mu_hat - half_width, mu_hat + half_width7. Discussion and Limitations
The always-valid framework assumes i.i.d. (or at least exchangeable) observations within each arm. Real rollouts experience time-of-day and day-of-week variation; we found this added roughly a 6% width inflation in practice but did not break coverage in retrospective replays.
The procedure is conservative for the first peek; if you only ever look once, a fixed-horizon test is tighter. The break-even point for our metrics is roughly two peeks: any rollout with two or more decision moments should use the always-valid bound.
8. Conclusion
The statistical cost of being able to peek whenever you want is small (around 11% in time-to-decision), and the engineering cost of not being able to peek is large (humans peek anyway). Always-valid CSes resolve the tension and we recommend they become the default in model-rollout dashboards.
References
- Howard, S. R. et al. (2021). Time-uniform Chernoff bounds via nonnegative supermartingales.
- Waudby-Smith, I., & Ramdas, A. (2024). Estimating means of bounded random variables by betting.
- Robbins, H. (1970). Statistical methods related to the law of the iterated logarithm.
- Johari, R. et al. (2017). Peeking at A/B tests: Why it matters, and what to do about it.
- Lindon, M., & Malek, A. (2022). Anytime-valid inference for multinomial count data.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.