Conformal Prediction Bounds for LLM Output Calibration
Conformal Prediction Bounds for LLM Output Calibration
1. Introduction
Production LLM systems often expose a confidence score, but those scores are seldom calibrated in any formal sense. Conformal prediction [Vovk et al. 2005] offers a remedy: under exchangeability, it produces prediction sets with finite-sample, distribution-free coverage guarantees. Applying it to LLMs requires care, because outputs are not categorical and the natural correctness criterion is task-specific.
2. Background
Given a non-conformity score and a calibration set exchangeable with the test point, split conformal returns the prediction set
where is the empirical quantile of the calibration scores. Coverage is guaranteed:
3. Method
3.1 Score for free-form outputs
We sample candidate completions at temperature . For each candidate we compute a non-conformity score
where is a learned correctness predictor (a small classifier trained on labeled (x, y, correct) triples) and trades off model log-probability with the auxiliary signal.
3.2 Calibration
On a held-out calibration set of items we set as above. The prediction set collects all candidates with score below the threshold.
3.3 Beyond exchangeability: weighted conformal
For agent trajectories, calibration items are drawn from earlier task distributions and may not be exchangeable with the deployment distribution. Following Tibshirani et al. [2019], we reweight calibration scores by likelihood ratios estimated via a domain classifier.
4. Results
We evaluate on three benchmarks (MMLU-Pro, GSM8K-augmented, and a proprietary support-ticket triage set).
| Method | Target | Empirical miscoverage | Mean | |-------------------------|----------------:|----------------------:|------------------:| | Temperature thresholding| 0.10 | 14.2% | 4.10 | | Split conformal (ours) | 0.10 | 9.7% | 2.34 | | Weighted conformal | 0.10 | 10.6% | 2.71 |
Standard errors are computed by 1000-fold cluster bootstrap over calibration splits.
4.1 Sensitivity to
Varying , the average set size monotonically shrinks from 2.94 to 2.10 while empirical coverage stays within 1 percentage point of nominal. This suggests the auxiliary correctness predictor genuinely sharpens the score.
import numpy as np
def split_conformal_quantile(scores, alpha):
n = len(scores)
q_level = np.ceil((n + 1) * (1 - alpha)) / n
return np.quantile(scores, min(q_level, 1.0), method="higher")
calib_scores = np.load("calib_scores.npy") # shape (n,)
q_hat = split_conformal_quantile(calib_scores, alpha=0.10)5. Discussion and Limitations
The coverage guarantee is marginal — there is no automatic conditional coverage by subgroup. We observed mild under-coverage on the lowest-frequency MMLU-Pro categories (12.8% vs nominal 10%), echoing prior findings on conditional miscoverage [Romano et al. 2020]. A localized conformal variant could address this at the cost of tighter sample-size requirements.
We also rely on the candidate set containing the correct answer with high probability; for on MMLU-Pro this holds 96.3% of the time, but tasks with longer answer spaces will need larger .
6. Conclusion
Split conformal prediction yields valid, sample-efficient coverage guarantees for LLM outputs at modest computational cost. We recommend it as the default uncertainty-quantification layer for agent systems where reliability matters.
References
- Vovk, V., Gammerman, A., Shafer, G. (2005). Algorithmic Learning in a Random World.
- Tibshirani, R. et al. (2019). Conformal Prediction Under Covariate Shift.
- Romano, Y. et al. (2020). Classification with Valid and Adaptive Coverage.
- Angelopoulos, A. and Bates, S. (2023). A Gentle Introduction to Conformal Prediction.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.