Calibration Collapse in Compound AI Systems: Error Propagation Across Chained Large Language Model Calls
Abstract
Compound AI systems chaining multiple LLM calls are increasingly deployed, but we demonstrate that calibration degrades rapidly across chains. We formalize calibration collapse and derive bounds on chain ECE. Experiments on 2,400 instances show 5-call chain ECE is 4.7x single-call ECE (independent errors) and 8.3x (correlated errors). A calibration-aware routing strategy reduces chain ECE by 43%.
1. Introduction
The shift from monolithic LLM inference to compound AI systems—where multiple LLM calls are orchestrated to solve complex tasks [1]—introduces a new class of reliability challenges. When a single LLM answers a question, its calibration (the alignment between stated confidence and actual accuracy) has been extensively studied [2, 3]. But when the output of one LLM call becomes the input to another, calibration errors compound in ways that have not been formally characterized.
We introduce the concept of calibration collapse: the systematic degradation of system-level calibration as the number of chained LLM calls increases, even when each individual call is well-calibrated.
2. Analytical Framework
2.1 Chain Calibration Model
Consider a chain of LLM calls where the output of feeds into . Let:
- be the stated confidence of the -th call
- be its prediction
- be the Expected Calibration Error of call in isolation
The chain output confidence is typically computed as:
The chain ECE under independent errors is bounded by:
{\text{chain}} \leq \sum{i=1}^{n} \text{ECE}i + \mathcal{O}\left(\sum{i<j} \text{Cov}(\epsilon_i, \epsilon_j)\right)
where is the calibration error of the -th call.
2.2 Confidence Inheritance
In practice, downstream calls do not independently assess the reliability of upstream outputs. Instead, they treat upstream outputs as ground truth, creating a confidence inheritance effect:
This means the actual accuracy of the chain decays as the product of per-call accuracies, while stated confidence may remain high because each call assesses only its own step, not the accumulated error.
3. Experimental Setup
3.1 Compound AI Patterns
We evaluate three common patterns:
| Pattern | Description | Chain Length | Instances |
|---|---|---|---|
| Sequential Pipeline | Output of call → input of call | 2, 3, 5, 7 | 800 |
| Fan-Out Aggregation | One call → parallel calls → aggregation call | 3, 5, 7 | 800 |
| Iterative Refinement | Same call repeated times with self-correction | 2, 3, 5, 7 | 800 |
3.2 Tasks
We construct tasks for each pattern:
- Pipeline: Multi-hop QA (retrieve → reason → format)
- Fan-out: Multi-perspective analysis (decompose → parallel analyze → synthesize)
- Iterative: Code generation (generate → test → debug → regenerate)
3.3 Models
All experiments use GPT-4-Turbo with temperature 0 and explicit confidence elicitation ("Rate your confidence from 0 to 1").
4. Results
4.1 ECE Scaling with Chain Length
| Chain Length | Single-Call ECE | Pipeline ECE | Fan-Out ECE | Iterative ECE |
|---|---|---|---|---|
| 1 | 0.042 | 0.042 | 0.042 | 0.042 |
| 2 | — | 0.098 | 0.071 | 0.083 |
| 3 | — | 0.168 | 0.093 | 0.121 |
| 5 | — | 0.287 | 0.089 | 0.198 |
| 7 | — | 0.391 | 0.104 | 0.264 |
Pipeline ECE grows approximately linearly with chain length (, ). Fan-out ECE saturates at ~0.09-0.10 due to the error-averaging effect of parallel calls. Iterative refinement shows sub-linear growth.
4.2 ECE Amplification Ratios
| Pattern | ECE at | Amplification (vs. single) | Theoretical (independent) |
|---|---|---|---|
| Pipeline | 0.287 | 6.83x | 5.0x |
| Fan-Out | 0.089 | 2.12x | 1.8x |
| Iterative | 0.198 | 4.71x | 3.5x |
Observed amplification exceeds the theoretical independent-error prediction by 37-87%, confirming positive error correlation.
4.3 Confidence Inheritance Analysis
We measure the correlation between upstream confidence and downstream confidence:
| Call Position | Conf. Correlation with Previous | Accuracy Given Upstream Error |
|---|---|---|
| 2 | 0.34 | 0.41 |
| 3 | 0.41 | 0.29 |
| 5 | 0.52 | 0.18 |
Downstream calls become increasingly correlated with upstream confidence (0.34 → 0.52) while accuracy conditional on upstream errors drops (0.41 → 0.18), demonstrating cascading failure.
4.4 Calibration-Aware Routing
We implement a simple intervention: if the confidence of call is below threshold , we route the output to a human-in-the-loop checkpoint rather than passing it to call .
| Threshold | Chain ECE | ECE Reduction | Throughput Loss |
|---|---|---|---|
| 1.0 (no routing) | 0.287 | — | 0% |
| 0.9 | 0.212 | 26% | 3% |
| 0.8 | 0.164 | 43% | 7% |
| 0.7 | 0.118 | 59% | 15% |
| 0.6 | 0.089 | 69% | 28% |
At , we achieve a 43% ECE reduction with only 7% of instances requiring human intervention.
5. Discussion
5.1 System-Level Calibration Is Not Compositional
Our central finding is that well-calibrated components do not compose into well-calibrated systems. This parallels the broader principle that system properties cannot be inferred from component properties [4], and extends it to the specific domain of probabilistic reliability.
5.2 Architectural Implications
The superiority of fan-out architectures for calibration maintenance suggests that compound AI system designers should prefer parallel decomposition over sequential chaining when possible. The error-averaging effect of aggregation naturally dampens calibration drift.
5.3 Limitations
Single model: All calls use GPT-4-Turbo. Heterogeneous chains (mixing models) may show different patterns.
Verbalized confidence: We use model-stated confidence rather than token probabilities. Logit-based calibration may behave differently.
Fixed tasks: Our three patterns cover common architectures but not all possible compositions.
No training intervention: We study post-hoc routing only. Joint training of chain components for calibration is unexplored.
Threshold sensitivity: The optimal routing threshold is task-dependent and requires tuning.
6. Conclusion
We formalized calibration collapse in compound AI systems, showing that chain ECE amplifies 4.7-8.3x over single-call ECE at chain length 5. The dominant mechanism—confidence inheritance—causes downstream calls to amplify upstream errors. Fan-out architectures degrade more gracefully than pipelines, and a simple calibration-aware routing strategy reduces chain ECE by 43%. We urge the compound AI community to adopt end-to-end calibration evaluation.
References
[1] M. Zaharia et al., "The shift to compound AI systems," Berkeley AI Research Blog, 2024.
[2] C. Guo et al., "On calibration of modern neural networks," ICML, 2017.
[3] S. Kadavath et al., "Language models (mostly) know what they know," arXiv:2207.05221, 2022.
[4] N. Leveson, "Engineering a safer world," MIT Press, 2012.
[5] Z. Jiang et al., "How can we know when language models know?," TACL, 2021.
[6] T. Schick et al., "Toolformer," NeurIPS, 2023.
[7] M. Chen et al., "Evaluating large language models trained on code," arXiv:2107.03374, 2021.
[8] Y. Tian et al., "Just ask for calibration," EMNLP, 2023.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.