Multi-Armed Bandits with Drifting Reward Distributions for Model Routing
Multi-Armed Bandits with Drifting Reward Distributions for Model Routing
1. Introduction
A production system runs five language models behind a router. Each query is sent to one model, the user's downstream behavior provides a noisy reward, and the router updates its policy. The reward distribution for each model drifts: a vendor pushes a new model version (sharp jump), the prompt mix shifts seasonally (slow drift), tools come online or fail. None of these dynamics fit the textbook stationary multi-armed bandit (MAB), and the textbook fixes — discounted UCB [Garivier & Moulines 2008], sliding-window UCB [Garivier & Moulines 2011] — require a hyperparameter (discount factor or window length ) that is hard to set without knowing the drift rate.
We propose DriftUCB, which estimates per-arm drift online and adapts its window. The contribution is twofold: (i) a regret analysis under a Lipschitz-drift assumption, and (ii) a practical algorithm with a long-horizon replay evaluation.
2. Setup
At each round , the player picks an arm and observes a reward drawn from a distribution with mean . We assume the means are Lipschitz in time:
This is weaker than the "piecewise stationary with breakpoints" assumption of [Garivier & Moulines 2011] but stronger than the worst-case adversarial setting.
The regret against the best dynamically-optimal arm is
3. Algorithm
For each arm , DriftUCB maintains a sliding window and an estimated drift rate . The window is set to balance the bias of stale samples against the variance of few samples:
The drift rate is estimated by comparing the empirical mean over the most-recent half of the window against the older half:
a = \frac{|\bar{\mu}{a, \text{recent}} - \bar{\mu}_{a, \text{old}}|}{W_a / 2}.
We stabilize the estimate with a Bayesian prior updated online.
Action selection uses an upper-confidence bound with the drift-aware width:
t(a) = \bar{\mu}{a, W_a}(t) + \sqrt{\frac{2 \log t}{n_a^{(W)}}} + \hat{L}_a \cdot W_a / 2
where the last term penalizes drift-induced stale-mean bias.
4. Regret Analysis
Theorem. For any reward sequence with Lipschitz drift bounded by , DriftUCB achieves expected regret
for an absolute constant .
The proof follows the standard bias-variance decomposition for sliding-window bandits, with the window length set adaptively. The matching lower bound holds by a direct reduction from the lower bound of [Besbes et al. 2014] for budgeted-variation bandits.
5. Experiments
5.1 Replay setting
We replayed 4.2M routing decisions over six months (March-September 2025) across five language-model arms. Reward is a derived score combining user-edit-distance, follow-up-rate, and explicit thumbs reactions, normalized to .
Ground-truth dynamic-best is approximated by the arm with highest centered-rolling-30-day-mean.
5.2 Baselines
- UCB1: stationary UCB, no drift handling.
- D-UCB: discounted UCB with (chosen via grid search on first-month holdout).
- SW-UCB: sliding-window UCB with (also grid-searched).
- DriftUCB: ours, adaptive.
| Algorithm | Cumulative regret (millions) | Relative |
|---|---|---|
| UCB1 | 0.184 | +63% |
| D-UCB | 0.144 | +28% |
| SW-UCB | 0.164 | +46% |
| DriftUCB (ours) | 0.113 | — |
DriftUCB cuts cumulative regret 22% vs. D-UCB and 31% vs. SW-UCB.
5.3 Where the wins come from
Most of DriftUCB's advantage is concentrated around two model-version-update events (April 28 and July 14), where the recent-vs-old comparison detected fast drift within a day, prompting a window contraction from ~30,000 to ~6,000. D-UCB responded over ~10 days; SW-UCB never adapted (its window is fixed).
class DriftUCB:
def __init__(self, K, alpha=2.0):
self.history = [[] for _ in range(K)]
self.alpha = alpha
def select(self, t):
ucbs = []
for a, h in enumerate(self.history):
W = self.adaptive_window(h, t)
recent = h[-W:]
mean = np.mean([r for _, r in recent]) if recent else 0.5
n = len(recent)
L_hat = self.estimate_drift(recent)
bonus = np.sqrt(self.alpha * np.log(t) / max(n, 1)) + L_hat * W / 2
ucbs.append(mean + bonus)
return int(np.argmax(ucbs))6. Discussion and Limitations
- The Lipschitz drift assumption is violated by abrupt model-version changes. In practice, the adaptive window detects them within a day or two, but the theoretical guarantee covers only smooth drift; an extended analysis covering piecewise-Lipschitz (smooth drift between jumps) is feasible and follows from a union bound.
- Reward observation has variable delay (hours to days for some signals); we treat the delay as additional noise in the reward. Modeling delay explicitly is interesting future work.
- The drift estimator has high variance early in each arm's history, when the window holds few samples. The Gamma prior controls this but introduces a hyperparameter we picked by visual inspection.
7. Conclusion
Reward distributions in language-model routing drift at a moderate rate, and a bandit algorithm that measures the drift rate online beats hand-tuned non-stationary baselines on a long replay. We give matching upper and lower bounds under a Lipschitz drift assumption and a working implementation suitable for production routers.
References
- Garivier, A., & Moulines, E. (2008). On upper-confidence bound policies for non-stationary bandit problems.
- Garivier, A., & Moulines, E. (2011). On upper-confidence bound policies for switching bandit problems.
- Besbes, O. et al. (2014). Stochastic multi-armed-bandit problem with non-stationary rewards.
- Cheung, W. C. et al. (2019). Learning to optimize under non-stationarity.
- Auer, P. et al. (2002). Finite-time analysis of the multiarmed bandit problem.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.