Self-Verifying Chain-of-Thought via Internal Consistency Checks
Self-Verifying Chain-of-Thought via Internal Consistency Checks
1. Introduction
A frequent error mode for CoT-prompted models is what we call silent contradiction: the trace asserts in step 3 and in step 7, but the final answer follows neither cleanly. We propose to make the model explicitly verify its own trace via a short, structured self-check before committing to an answer.
2. Background
Self-consistency [Wang et al. 2022] samples multiple traces and majority-votes the answer. Verifier models [Cobbe et al. 2021] train a separate classifier to score traces. Both work, but at substantial extra cost (≥ 5x inference for self-consistency, training pipeline overhead for verifiers).
We explore a third option: a single trace, with an internal verification phase, executed by the same model.
3. Method
The SV-CoT prompt pattern is:
<reasoning>
step 1, step 2, ..., step n
</reasoning>
<consistency_claims>
- C1: ...
- C2: ...
- C3: ...
</consistency_claims>
<consistency_check>
C1: holds, because ...
C2: contradicted by step 4 — REVISE
C3: holds
</consistency_check>
<answer>...</answer>If any claim is marked REVISE, the model is instructed to repeat the reasoning phase once more. We cap the loop at two iterations.
3.1 Formal interpretation
Let be the trace and be the consistency claims. Define the model's verification verdict . SV-CoT commits an answer iff
with in our experiments.
4. Datasets
- GSM8K-hard ().
- AR-LSAT ().
- ARC-Challenge-extended ().
- TempLogic-300, a custom suite of 300 temporal-reasoning puzzles, generated and human-validated.
5. Results
| Benchmark | CoT baseline | SV-CoT | Self-consistency (k=8) |
|---|---|---|---|
| GSM8K-hard | 64.2 | 67.0 (+2.8) | 70.1 |
| AR-LSAT | 51.8 | 55.6 (+3.8) | 57.0 |
| ARC-Challenge-ext | 78.4 | 80.6 (+2.2) | 82.1 |
| TempLogic-300 | 42.3 | 51.4 (+9.1) | 49.0 |
SV-CoT closes about two-thirds of the gap between vanilla CoT and 8-sample self-consistency, at roughly 1.4x the inference cost (rather than 8x). On TempLogic-300 it actually exceeds self-consistency, suggesting that explicit consistency claims help on tasks where multiple plausible traces share the same systematic error.
6. Discussion
Why does SV-CoT help most on TempLogic? Temporal puzzles have many implicit constraints (e.g., transitivity of "before") that majority-voting across samples does not surface. Explicit consistency claims force the model to articulate them.
Where SV-CoT fails. When the model is systematically wrong in the same direction (e.g., misreads the question), the consistency claims it generates inherit the same misreading, and verification is vacuous. We see this on roughly 12% of GSM8K-hard errors.
7. Limitations
The method is sensitive to the prompt template. A poorly chosen claim format (e.g., "is the answer correct?") collapses to a yes-bias. We provide a recommended template and a small validation harness.
8. Conclusion
A brief, structured self-verification phase recovers most of the benefit of expensive self-consistency at a fraction of the cost, and excels on tasks where contradictions are the dominant error. Future work: combining SV-CoT with reflexion-style outer loops [Shinn et al. 2023], and learning the consistency-claim template via RL.
References
- Wang, X. et al. (2022). Self-Consistency Improves Chain of Thought Reasoning.
- Cobbe, K. et al. (2021). Training Verifiers to Solve Math Word Problems.
- Shinn, N. et al. (2023). Reflexion.
- Wei, J. et al. (2022). Chain-of-Thought Prompting.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.