Contagion of Errors: How One Faulty AI Agent Can Crash a Network
Introduction
As AI systems scale from isolated models to interconnected networks—retrieval-augmented generation pipelines, multi-agent debate systems, and ensemble prediction markets—understanding failure propagation becomes critical. A single agent producing incorrect outputs can corrupt its dependents, which corrupt their dependents, producing a cascade analogous to systemic risk in financial networks[acemoglu2015].
We draw on network science to study this problem. Albert, Jeong, and Barab'{a}si[albert2000] showed that scale-free networks are robust to random failures but fragile to targeted hub attacks. Watts[watts2002] demonstrated that global cascades in networks depend on a threshold mechanism where local failures go systemic above a critical connectivity level. We extend these insights to AI agent networks with heterogeneous agent types.
Contributions
- A simulation framework studying error propagation through 6 network topologies with 3 agent processing types, totaling 324 controlled experiments.
- Four metrics—cascade size, cascade speed, recovery time, and systemic risk score—that quantify network fragility.
- Evidence that network topology dominates agent type in determining cascade outcomes, with connectivity being the primary risk factor.
- A fully agent-executable skill: all code runs from
SKILL.mdusing only Python standard library plus pytest.
Methods
Network Topologies
We study agents arranged in 6 topologies:
- Chain: linear sequence; each agent depends on one neighbor.
- Ring: chain with endpoints connected.
- Star: one hub connected to all others.
- Erd\H{os--R'{e}nyi} (): random edges.
- Scale-free (Barab'{a}si--Albert, ): preferential attachment.
- Fully connected: every agent depends on every other.
Agent Types
Each agent at round computes its output from neighbor outputs plus noise :
where is the mean of neighbor outputs, is a decay factor, and is the clipping bound. The fragile agent relays signals linearly, the averaging agent applies saturation, and the robust agent additionally clips extreme inputs.
Shock Protocol
At round , a single agent begins outputting a fixed error signal of magnitude (mild, moderate, severe) for 200 rounds. We test two shock locations: "random" (non-hub node) and "hub" (highest-degree node).
Metrics
We run paired simulations—one clean baseline and one shocked—using identical random seeds so that noise sequences match. An agent is infected at round if .
- Cascade size: fraction of agents ever infected.
- Cascade speed: rounds from shock onset to 50% infection ( if never reached).
- Recovery time: rounds after shock removal until zero agents remain infected ( if never).
- Systemic risk: , where is cascade size, is cascade speed, is recovery time, and is total rounds.
Experiment Design
topologies agent types shock magnitudes shock locations seeds simulations, each running 5,000 rounds.
All simulations execute in parallel via Python's multiprocessing.Pool.
Results
Topology Risk Ranking
Systemic risk by topology (mean ± std across all conditions).
| Topology | Systemic Risk | Cascade Size |
|---|---|---|
| Fully connected | 1.417 ± 0.091 | 1.000 |
| Scale-free | 1.394 ± 0.127 | 1.000 |
| Star | 1.389 ± 0.136 | 1.000 |
| Erd\Hos--R'enyi | 1.287 ± 0.053 | 0.983 |
| Ring | 0.771 ± 0.283 | 0.700 |
| Chain | 0.588 ± 0.278 | 0.550 |
Fully connected networks are the most fragile: every agent is a direct neighbor of the shocked agent, so errors reach all nodes within one round. Chain topologies provide natural firebreaks—errors must propagate sequentially, giving the network time to dampen them.
Agent Type Resilience
Cascade size by agent type (mean ± std).
| Agent Type | Mean Cascade Size |
|---|---|
| Robust | 0.825 ± 0.250 |
| Averaging | 0.826 ± 0.248 |
| Fragile | 0.966 ± 0.104 |
Robust and averaging agents achieve similar resilience (15% lower cascade size than fragile agents). Both use nonlinearity, which saturates for large error signals and prevents unbounded error propagation.
Hub vs. Random Attack
In star networks, hub attacks cause 100% cascades while random (leaf) attacks have smaller impact. For scale-free and fully connected networks, both attack types reach full cascade, but hub attacks propagate faster. Chain topologies show the most differentiation: hub attacks affect fewer nodes than random peripheral attacks because the "hub" of a chain (a central node) has the same degree as neighbors.
Discussion
Topology dominates agent design. The spread between the most and least risky topologies (fully connected: 1.42 vs. chain: 0.59) is larger than the spread between agent types (fragile: 0.97 vs. robust: 0.83). This suggests that architectural choices about inter-agent connectivity matter more than individual agent hardening.
Connectivity is a double-edged sword. High connectivity enables fast information aggregation but also enables fast error propagation. This mirrors the efficiency-fragility tradeoff observed in financial networks[acemoglu2015].
AI safety implications. Modern AI infrastructure (model chains, agentic pipelines) should incorporate circuit breakers—topological constraints that limit error propagation paths. Low-connectivity relay patterns (chain-like) are more resilient than fully-connected ensemble designs.
Limitations
Our agents use simplified processing functions (, linear relay) rather than actual neural network computations. The fixed-magnitude shock model does not capture gradual degradation. With agents, finite-size effects may influence results; larger-scale studies would strengthen the conclusions.
Conclusion
We present an agent-executable simulation studying cascading failures across 324 configurations of multi-agent AI networks. Our key finding is that network topology is the dominant factor in cascade risk: highly connected networks that maximize information flow are also the most vulnerable to error contagion. Robust agent designs (input clipping + nonlinear saturation) provide a 15% reduction in cascade size but cannot compensate for fragile topologies. These results have direct implications for designing resilient AI infrastructure.
\bibliographystyle{plain}
References
[albert2000] R. Albert, H. Jeong, and A.-L. Barab'{a}si. Error and attack tolerance of complex networks. Nature, 406(6794):378--382, 2000.
[watts2002] D. J. Watts. A simple model of global cascades on random networks. Proceedings of the National Academy of Sciences, 99(9):5766--5771, 2002.
[acemoglu2015] D. Acemoglu, A. Ozdaglar, and A. Tahbaz-Salehi. Systemic risk and stability in financial networks. American Economic Review, 105(2):564--608, 2015.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
---
name: cascading-failures-multi-agent-networks
description: Simulate cascading failures in multi-agent AI networks. Studies how one faulty agent's errors propagate through 6 network topologies (chain, ring, star, Erdos-Renyi, scale-free, fully-connected) with 3 agent types (robust, fragile, averaging). Runs 324 simulations with multiprocessing to measure cascade size, speed, recovery time, and systemic risk.
allowed-tools: Bash(python *), Bash(python3 *), Bash(pip *), Bash(.venv/*), Bash(cat *), Read, Write
---
# Cascading Failures in Multi-Agent AI Networks
This skill simulates error propagation through multi-agent networks to study which topologies and agent designs are resilient vs fragile to cascading failures.
## Prerequisites
- Requires **Python 3.10+**. No internet access needed (pure stdlib + pytest).
- Expected runtime: **~90 seconds** for the full 324-simulation experiment.
- All commands must be run from the **submission directory** (`submissions/cascading-failures/`).
## Step 0: Get the Code
Clone the repository and navigate to the submission directory:
```bash
git clone https://github.com/davidydu/Claw4S.git
cd Claw4S/submissions/cascading-failures/
```
All subsequent commands assume you are in this directory.
## Step 1: Environment Setup
Create a virtual environment and install dependencies:
```bash
python3 -m venv .venv
.venv/bin/pip install --upgrade pip
.venv/bin/pip install -r requirements.txt
```
Verify installation:
```bash
.venv/bin/python -c "import pytest; print('All imports OK')"
```
Expected output: `All imports OK`
## Step 2: Run Unit Tests
Verify all modules work correctly (31 tests):
```bash
.venv/bin/python -m pytest tests/ -v
```
Expected: `31 passed` and exit code 0.
## Step 3: Run Diagnostic
Quick validation with 18 simulations (1 topology, 1 agent type):
```bash
.venv/bin/python run.py --diagnostic
```
Expected: Prints report and exits with code 0. Creates `results/results.json` and `results/report.md`.
## Step 4: Run Full Experiment
Execute all 324 simulations (6 topologies x 3 agent types x 3 shock magnitudes x 2 shock locations x 3 seeds):
```bash
.venv/bin/python run.py
```
Expected: Prints `Completed 324 simulations` and full report. Creates `results/results.json` and `results/report.md`.
This will:
1. Generate networks for all 6 topologies (N=20 agents each)
2. Run paired simulations (clean baseline + shocked) for each configuration
3. Track error propagation: cascade size, speed, recovery time, systemic risk
4. Aggregate metrics across seeds with mean and standard deviation
5. Save raw and aggregated results to `results/results.json`
6. Generate summary report at `results/report.md`
## Step 5: Validate Results
Check completeness and scientific sanity:
```bash
.venv/bin/python validate.py
```
Expected: Prints simulation counts, agent comparisons, and `Validation passed.`
## Step 6: Review the Report
```bash
cat results/report.md
```
Expected: Markdown report with topology risk ranking, hub vs random attack comparison, agent type resilience ranking, and key findings.
## How to Extend
- **Add topologies:** Implement a new generator in `src/network.py` returning `AdjList`, add to `TOPOLOGIES` dict.
- **Add agent types:** Implement a new function in `src/agents.py` with signature `(List[float], float) -> float`, add to `AGENT_TYPES` dict.
- **Change parameters:** Edit `src/experiment.py` constants: `N_AGENTS`, `TOTAL_ROUNDS`, `SHOCK_MAGNITUDES`, `SEEDS`.
- **Add metrics:** Extend `src/metrics.py` with new aggregation functions.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.