Online Conformal Calibration for Streaming Generative Models
Online Conformal Calibration for Streaming Generative Models
1. Introduction
Deployed generative systems see traffic that drifts: new product categories, evolving user phrasing, news-driven topical shifts. Calibration computed on a fixed offline split degrades. We propose an online conformal calibration scheme tracking long-run coverage in expectation, building on adaptive conformal inference [Gibbs & Candes 2021].
2. Background
Let be a stream of inputs and ground truths and a non-conformity score (we treat as a delayed-feedback label). The classical split-conformal threshold is replaced with a time-varying , updated online.
3. Method
3.1 Update rule
At each step, after observing whether :
for a learning rate . Under mild assumptions on score boundedness, the long-run miscoverage satisfies
We pick adaptively via the Robbins-Monro schedule .
3.2 Handling delayed feedback
Ground-truth labels in our setting arrive after a delay . We accumulate updates in a per-day buffer and apply them in order; the long-run coverage guarantee still holds with a scaling penalty .
3.3 Pseudocode
def online_conformal(stream, score_fn, alpha=0.1, eta0=0.05):
q = 0.0
for t, (x, y) in enumerate(stream, start=1):
s = score_fn(x, y)
in_set = s <= q
eta = eta0 / (t ** 0.5)
q = q + eta * (alpha - (0 if in_set else 1))
yield q, in_set4. Experiments
4.1 Setup
We simulate a 60-day deployment trace, requests, with three drift events: (a) gradual covariate shift across days 5-15, (b) an abrupt distribution swap on day 28, and (c) a recurring weekly seasonality. The base model is a 7B summarization model and the score is a calibrated reference-free quality predictor.
4.2 Coverage tracking
| Period | Static split | ACI [GC21] | Ours |
|---|---|---|---|
| Pre-drift (1-5) | 9.8% | 9.7% | 9.9% |
| Gradual (5-15) | 13.4% | 10.6% | 10.2% |
| Abrupt (28) | 19.2% | 12.1% | 10.8% |
| Steady-state (29-60) | 14.1% | 10.4% | 9.8% |
Target .
4.3 Set size
Mean prediction-set size is 2.81 (ours) vs. 2.93 (ACI) vs. 2.42 (static, but undercovers). Our method's tighter set reflects the per-step adaptation pulling down once coverage stabilizes.
4.4 Robustness to delay
Holding feedback for leaves long-run miscoverage within . At the gap widens to .
5. Discussion and Limitations
The procedure is marginal-coverage online: it does not certify per-subgroup coverage. We suggest stratified online conformal as a follow-up.
We assume bounded scores; unbounded log-likelihoods can cause runaway . Practitioners should clip scores to a reasonable range or use a sigmoid transform before the update.
6. Conclusion
Online conformal calibration with a Robbins-Monro step size offers a simple, memory-light defense against drift in streaming generative-model deployments. Its long-run coverage tracks the target within 1 percentage point under realistic drift profiles.
References
- Gibbs, I. and Candes, E. (2021). Adaptive Conformal Inference Under Distribution Shift.
- Robbins, H. and Monro, S. (1951). A Stochastic Approximation Method.
- Barber, R. F. et al. (2023). Conformal Prediction Beyond Exchangeability.
- Angelopoulos, A. et al. (2023). Conformal PID Control for Time Series Prediction.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.