Recursive Reasoning in Multi-Agent Systems: Strategic Depth as a Distributional Safety Risk
Introduction
The intersection of recursive reasoning and multi-agent safety is understudied. Most alignment research treats agents as either aligned or misaligned in a binary sense. The SWARM framework instead uses soft probabilistic labels — each interaction receives a probability p of being beneficial — enabling a richer distributional analysis of safety outcomes.
Recursive Language Models (RLMs) represent a class of agents that apply iterated reasoning over their action space: at recursion depth k, an RLM models its counterparties as depth-(k-1) reasoners and best-responds accordingly. This mirrors the level-k thinking framework from behavioral game theory (Stahl & Wilson, 1994; Nagel, 1995). The key safety question is: does increased reasoning depth create emergent coordination risks that governance mechanisms cannot keep pace with?
We operationalize this question through three experiments:
- Recursive Collusion (Exp. 1): Does deeper recursion enable implicit coordination without explicit communication?
- Memory-as-Power (Exp. 2): Does asymmetric working memory create exploitative power imbalances?
- Governance Lag (Exp. 3): Can fast-adapting agents outpace deliberately slow governance responses?
Each experiment uses 10 pre-registered seeds with Bonferroni and Holm-Bonferroni correction for multiple comparisons.
Experimental Setup
RLM Agent Architecture
RLM agents implement level-k thinking algorithmically (no LLM API calls). The key parameters are:
| Parameter | Description | Values Tested |
|---|---|---|
recursion_depth |
Levels of iterated best response | 1, 3, 5 |
planning_horizon |
Steps of discounted look-ahead | 3, 5, 7, 8 |
memory_budget |
Max entries in structured working memory | 10, 50, 100, 150, 200 |
At each decision point, a depth-k agent: (1) generates candidate actions, (2) evaluates each via _evaluate_action_recursive(), (3) models counterparties at depth-(k-1) using CounterpartyModel, and (4) selects actions via epsilon-greedy exploration at rate 0.1/(depth+1). Level-0 (base case) selects the highest-trust visible agent and proposes collaboration, equivalent to a naive honest agent.
Scenarios
| Scenario | Agents | RLM Config | Epochs | Steps | Network | Governance |
|---|---|---|---|---|---|---|
| Recursive Collusion | 9 RLM + 3 honest | depth={1,3,5} × 3 | 30 | 15 | Complete | Collusion detection, audit (p=0.15) |
| Memory-as-Power | 7 RLM + 3 honest | mem={200,50,10}, depth=3 | 40 | 15 | Complete | Audit (p=0.1), circuit breaker, ρ=0.1 |
| Governance Lag | 4 RLM + 4 honest + 2 adversary | depth=5, mem=150 | 50 | 15 | Small-world (k=4, p=0.2) | Slow: audit (p=0.05), high thresholds |
Payoff Model
All experiments use the SWARM soft payoff engine: expected surplus with , ; externality cost with ; acceptance threshold .
Statistical Methods
- 10 seeds per experiment, fixed a priori: {42, 7, 123, 256, 999, 2024, 314, 577, 1337, 8080}
- Two-sample independent t-test, one-way ANOVA, Pearson correlation
- Cohen's d for effect sizes, Gini coefficient for inequality
- Bonferroni and Holm-Bonferroni corrections per experiment
- Total tests: 10 (Exp. 1) + 11 (Exp. 2) + 5 (Exp. 3) = 26
Results
Cross-Experiment Summary
| Experiment | Primary Finding | Key Statistic | Gini | Honest Advantage |
|---|---|---|---|---|
| Recursive Collusion | Deeper recursion hurts payoff | r = -0.746, p < 0.001 | 0.299 | 2.8× |
| Memory-as-Power | Memory advantage real but modest | r = +0.673, p < 0.001 | 0.236 | 2.3× |
| Governance Lag | RLM outperforms via strategy, not evasion | d = 2.14, p = 0.0001 | 0.325 | 0.72× (reversed) |
Experiment 1: Recursive Collusion
Hypothesis: Deeper recursive reasoning enables implicit coordination, producing higher payoffs for deep thinkers.
Result: Rejected. Deeper recursion monotonically decreases payoff.
| Group | Recursion Depth | Mean Payoff | Std |
|---|---|---|---|
| RLM Shallow | 1 | 219.661 | 4.367 |
| RLM Mid | 3 | 213.643 | 2.010 |
| RLM Deep | 5 | 211.351 | 2.219 |
| Honest | — | 592.980 | 9.886 |
| Test | Statistic | p-value | Cohen's d | Bonferroni | Holm |
|---|---|---|---|---|---|
| Shallow vs Mid | t = 3.959 | 0.0009 | 1.770 | Survives | Survives |
| Shallow vs Deep | t = 5.365 | < 0.0001 | 2.399 | Survives | Survives |
| Mid vs Deep | t = 2.421 | 0.026 | 1.082 | Fails | Survives |
| ANOVA (RLM tiers) | F = 19.712 | < 0.0001 | — | Survives | Survives |
| Pearson (depth, payoff) | r = -0.746 | < 0.0001 | — | Survives | Survives |
| All RLM vs Honest | t > 109 | < 0.0001 | d > 48 | Survives | Survives |
The "strategic overthinking" effect: more computation cycles spent modeling counterparties yields worse outcomes. RLM payoffs are remarkably stable (std = 2–4), while honest agent payoffs are highly variable (std ≈ 10), suggesting RLM reasoning produces consistent but mediocre outcomes.
Experiment 2: Memory-as-Power
Hypothesis: Agents with larger working memory budgets systematically exploit agents with smaller budgets.
Result: Supported, with caveats. The gradient is monotonic and statistically robust, but practically small.
| Group | Memory Budget | Mean Payoff | Std |
|---|---|---|---|
| RLM High | 200 | 253.376 | 3.613 |
| RLM Mid | 50 | 249.063 | 2.579 |
| RLM Low | 10 | 245.442 | 4.140 |
| Honest | — | 563.106 | 6.892 |
| Test | Statistic | p-value | Cohen's d | Bonferroni | Holm |
|---|---|---|---|---|---|
| High vs Low | t = 4.566 | 0.0002 | 2.042 | Survives | Survives |
| High vs Mid | t = 3.072 | 0.007 | 1.374 | Fails | Survives |
| Mid vs Low | t = 2.348 | 0.031 | 1.050 | Fails | Survives |
| ANOVA (RLM tiers) | F = 12.845 | 0.0001 | — | Survives | Survives |
| Pearson (budget, payoff) | r = +0.673 | < 0.0001 | — | Survives | Survives |
The high-low spread is only 7.93 points (3.2%). The dominant effect remains the honest-vs-RLM gap (2.3×). Externality internalization (ρ = 0.1) produces the lowest Gini (0.236) across all three experiments.
Experiment 3: Governance Lag
Hypothesis: Fast-adapting RLM agents can exploit the gap between harmful action and governance response.
Result: Partially supported. RLM agents outperform, but through legitimate strategy rather than governance evasion.
| Group | Mean Payoff | Std |
|---|---|---|
| RLM Fast (depth=5) | 335.606 | 19.600 |
| Honest | 242.755 | 58.128 |
| Adaptive Adversary | 221.490 | 174.549 |
| Test | Statistic | p-value | Cohen's d | Bonferroni | Holm |
|---|---|---|---|---|---|
| RLM vs Honest | t = 4.787 | 0.0001 | 2.141 | Survives | Survives |
| RLM vs Adversary | t = 2.054 | 0.055 | 0.919 | Fails | Fails |
| Gini > 0 | t = 12.891 | < 0.0001 | — | Survives | Survives |
This is the only experiment where RLM agents outperform honest agents, reversing the pattern from Exp. 1–2. The mechanism is not governance evasion — RLM signal profiles produce moderate p-values (~0.5–0.7), well above the harm threshold. Instead, depth-5 agents exploit the small-world network topology for better partner selection.
Cross-Experiment Patterns
The honest advantage is context-dependent. In complete networks (Exp. 1–2), honest agents earn 2.3–2.8×. In small-world networks with slow governance (Exp. 3), RLM agents earn 1.4×. Network topology mediates the value of strategic reasoning.
Strategic reasoning produces stability at the cost of magnitude. RLM group-mean standard deviations range from 2.0 to 19.6, while honest group-mean standard deviations range from 6.9 to 58.1.
Gini coefficients reveal governance quality. Exp. 2 (lowest Gini = 0.236) uses externality internalization (ρ = 0.1); Exp. 3 (highest Gini = 0.325) uses deliberately weak governance.
Robustness. Of 26 total tests, 24 survive Holm-Bonferroni and 20 survive strict Bonferroni.
Discussion
Strategic Overthinking
The most surprising finding is that deeper recursion hurts within-group payoff (Exp. 1, r = -0.75). We hypothesize three mechanisms: (1) computational cost without information gain when all agents are similarly sophisticated, (2) overcaution from aggressive discounting that reduces interaction volume, and (3) epsilon-greedy decay (exploration rate 0.1/(depth+1)) causing deeper agents to miss beneficial interactions.
Memory as a Modest Power Lever
The memory-as-power effect (r = +0.67) is real but small (3.2% spread). In SWARM-style ecosystems, information asymmetry alone does not confer large advantages. The dominant factor is the type of strategy (honest vs. strategic), not the resources available to a given strategy.
Network Topology as a Moderator
The reversal of the honest advantage in Exp. 3 (small-world) vs. Exp. 1–2 (complete network) suggests that network topology is a critical moderator of distributional safety. In complete networks, honest agents benefit from high connectivity; in small-world networks, strategic agents exploit structural holes. Governance mechanisms effective in well-connected networks may not transfer to sparse topologies.
Governance Implications
| Lever | Observed Effect |
|---|---|
| Externality internalization (ρ > 0) | Reduces Gini by ~0.06–0.09 |
| Higher audit probability | No significant effect on RLM behavior |
| Small-world topology | Enables strategic agents, disadvantages honest |
| Collusion detection | No implicit collusion detected in any experiment |
Limitations
- RLM agents use algorithmic level-k reasoning, not LLM inference. Results may not transfer to LLM-based agents.
- Fixed payoff parameters (, , ) may not generalize.
- Multiple comparisons correction is per-experiment, not study-wide (26 tests).
- 10 seeds per experiment provides adequate power for large effects (d > 1) but may miss smaller effects.
Reproducibility
Source code: https://github.com/swarm-ai-safety/swarm
# Clone the repo
git clone https://github.com/swarm-ai-safety/swarm.git
cd swarm && pip install -e '.[dev,runtime]'
# Experiment 1: Recursive Collusion
python -m swarm run scenarios/rlm_recursive_collusion.yaml --seed 42 7 123 256 999 2024 314 577 1337 8080
# Experiment 2: Memory-as-Power
python -m swarm run scenarios/rlm_memory_as_power.yaml --seed 42 7 123 256 999 2024 314 577 1337 8080
# Experiment 3: Governance Lag
python -m swarm run scenarios/rlm_governance_lag.yaml --seed 42 7 123 256 999 2024 314 577 1337 8080References
- Stahl, D. O., & Wilson, P. W. (1994). Experimental evidence on players' models of other players. Journal of Economic Behavior & Organization, 25(3), 309–327.
- Nagel, R. (1995). Unraveling in guessing games: An experimental study. American Economic Review, 85(5), 1313–1326.
- Crawford, V. P., Costa-Gomes, M. A., & Irber, N. (2013). Structural models of nonequilibrium strategic thinking. Journal of Economic Literature, 51(1), 5–62.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.