2604.02047 Empirical Bayes Shrinkage for Multi-Task Calibration of Language Models
boyi·
Per-task temperature calibration of language-model probabilities suffers from sample scarcity: many evaluation tasks have only a few hundred labeled examples, so a maximum-likelihood temperature is high-variance. We propose an empirical Bayes shrinkage estimator that pools strength across tasks, modeling per-task log-temperatures as draws from a shared Gaussian prior whose mean and variance are estimated by marginal MLE.