A Bayesian Treatment of Self-Consistency Voting in Language Model Reasoning
A Bayesian Treatment of Self-Consistency Voting in Language Model Reasoning
1. Introduction
Self-consistency [Wang et al. 2022] has emerged as one of the simplest and most effective inference-time techniques for chain-of-thought reasoning: sample rationales independently from the same prompt and return the modal final answer. Despite its widespread use, self-consistency provides only a point estimate. There is no principled way to express the difference in confidence between, say, an 8/10 majority and a 4/10 plurality, even though their reliability characteristics differ markedly.
In this paper we recast self-consistency voting as posterior inference and derive a small set of closed-form formulas that turn vote counts into calibrated probabilities and credible intervals.
2. Generative Model
For a fixed prompt , suppose the model induces a distribution over candidate final answers, where is the probability that an independent sample produces answer . We do not observe ; we observe vote counts from independent samples.
We place a Dirichlet prior with weakly-informative concentration . By conjugacy,
The posterior probability that answer is correct, if we identify correctness with mode in the population sense, is
This integral admits a tractable Monte-Carlo estimate, but for the binary-relevant case (top-1 versus the rest) the marginal is a Beta:
} \mid \boldsymbol{n} \sim \mathrm{Beta}(\alpha_0 + n_{m^}, , (M-1)\alpha_0 + K - n_{m^*}).
3. Calibration
The naive baseline reports confidence . Our posterior-mean estimate is
{\text{Bayes}} = \frac{\alpha_0 + n{m^*}}{M \alpha_0 + K}
with a credible interval from the Beta quantile function.
We choose by maximum marginal likelihood on a held-out calibration set per benchmark; values cluster between 0.4 and 1.1 across our domains.
from scipy.stats import beta
def bayes_confidence(n_top, K, M, alpha0=0.7):
a = alpha0 + n_top
b = (M - 1) * alpha0 + K - n_top
mean = a / (a + b)
lo, hi = beta.ppf([0.025, 0.975], a, b)
return mean, (lo, hi)4. Experimental Setup
We evaluate on six benchmarks: GSM8K, MATH, MMLU-pro, ARC-Challenge, BIG-Bench-Hard, and a private code-debugging set. For each we sample rationales per problem from a 70B open-weights model.
We report Expected Calibration Error (ECE), Brier score, and the area under the selective-prediction curve (AURC).
5. Results
| Benchmark | ECE (vote-frac) | ECE (Bayes) | Brier (vote) | Brier (Bayes) |
|---|---|---|---|---|
| GSM8K | 0.122 | 0.029 | 0.092 | 0.071 |
| MATH | 0.184 | 0.041 | 0.218 | 0.190 |
| MMLU-pro | 0.097 | 0.022 | 0.143 | 0.131 |
| ARC-Ch. | 0.066 | 0.018 | 0.078 | 0.071 |
| BBH | 0.143 | 0.038 | 0.151 | 0.132 |
| Code-debug | 0.215 | 0.052 | 0.231 | 0.207 |
The Bayesian estimate reduces ECE by a factor of 3-5x consistently. The improvement is most pronounced when the vote count is small or when (the empirical answer cardinality) is large.
5.1 Selective Prediction
Using the Bayesian confidence as a deferral score, we obtain a coverage-accuracy curve dominating the vote-fraction baseline at every coverage level above 0.5. At a deferral budget of 18% of inputs, accuracy on accepted inputs rises from 64.0 to 73.6 on MATH.
6. Discussion
The key insight is that vote fractions are not probabilities — they are maximum-likelihood estimates with noise that depends on and on . Adding a small Dirichlet pseudocount produces a shrinkage estimator that calibrates well even at the small commonly used in production.
A philosophical caveat: we identified correctness with population mode, which is not always the same as ground-truth correctness. When the model has a systematic bias toward a wrong answer, Bayes confidence will be high and wrong. We see this rarely in practice (less than 4% of GSM8K errors) but it bounds achievable calibration.
7. Limitations
The approach assumes independent samples. Sequential decoding with shared prefix or KV-cache shortcuts may induce correlation; in such cases should be replaced by an effective sample size. We have not investigated this empirically.
We also assume a flat prior on . Hierarchical priors that pool across problems within a benchmark may yield further gains and remain future work.
8. Conclusion
A two-line modification turns self-consistency into a calibrated, uncertainty-aware procedure. We hope this encourages reporting of full confidence distributions rather than point answers, particularly in settings where downstream policy depends on confidence.
References
- Wang, X. et al. (2022). Self-consistency improves chain-of-thought reasoning.
- Guo, C. et al. (2017). On calibration of modern neural networks.
- Kadavath, S. et al. (2022). Language models (mostly) know what they know.
- Geifman, Y. et al. (2017). Selective classification for deep neural networks.
- Gelman, A. et al. (2013). Bayesian Data Analysis.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.