How Fast Can You Break a World Model? Adversarial Belief Manipulation in Multi-Agent Systems
Introduction
The robustness of learned world models to adversarial manipulation is a central concern for AI safety [hubinger2019risks, park2023ai]. When an agent maintains beliefs about a hidden environment state and updates those beliefs based on signals from other agents, a natural question arises: \emph{how quickly can a strategic adversary corrupt those beliefs?}
This question connects to the Bayesian Persuasion literature [kamenica2011bayesian] and strategic information transmission [crawford1982strategic], but differs in that we study a repeated game where the adversary can observe the learner's evolving beliefs and adapt its deception strategy accordingly.
We formalize this as a signaling game with three axes of variation: learner type (how much the learner trusts signals), adversary strategy (random, greedy-deceptive, or credibility-building), and environment dynamics (stable, slow-drift, volatile).
Model
Environment. A hidden environment has discrete states. At each round , the true state is drawn from a regime-dependent process: stable ( for all ), slow-drift (redrawn every 5,000 rounds), or volatile (redrawn every 500 rounds).
Adversary. The adversary observes and sends a signal . We study three strategies:
- Random (RA): .
- Strategic (SA): , reinforcing the learner's strongest incorrect belief.
- Patient (PA): Truthful for 200 rounds, then switches to the SA strategy.
Learner. The learner maintains a belief vector and updates it upon receiving via multiplicative Bayesian updating with signal strength and a belief floor of $\epsilon = 0.01$ (preventing irreversible collapse). We study:
- Naive (NL): Full trust — applies likelihood ratio at the signaled state.
- Skeptical (SL): Blends the signal likelihood with a uniform distribution using trust factor .
- Adaptive (AL): Tracks signal--action consistency via exponential moving average () and uses this as a dynamic trust factor, clamped to .
Metrics. Belief error: . Decision accuracy: fraction of rounds where . Exploitation gap: difference in signal truthfulness between the first 500 and last 500 rounds.
Experiment
We run all learner--adversary matchups across 3 environment regimes and 2 noise levels (0.0 and 0.1), with 3 random seeds each, for rounds per simulation. This yields 162 simulations completed in 15 seconds on an 11-core machine.
Results
\caption{Final belief error (last 20% of rounds, mean std across 3 seeds). Noise = 0.0.}
| RA | SA | PA | |
|---|---|---|---|
| μlticolumn4lStable environment | |||
| NL | 0.805 ± 0.007 | 0.998 ± 0.000 | 0.998 ± 0.000 |
| SL | 0.812 ± 0.010 | 0.998 ± 0.000 | 0.998 ± 0.000 |
| AL | 0.811 ± 0.025 | 0.998 ± 0.000 | 0.998 ± 0.000 |
| μlticolumn4lVolatile environment | |||
| NL | 0.794 ± 0.030 | 0.995 ± 0.001 | 0.995 ± 0.001 |
| SL | 0.784 ± 0.039 | 0.992 ± 0.002 | 0.992 ± 0.002 |
| AL | 0.769 ± 0.044 | 0.995 ± 0.001 | 0.995 ± 0.001 |
Decision accuracy (mean ± std across 3 seeds). Noise = 0.0.
| RA | SA | PA | |
|---|---|---|---|
| μlticolumn4lStable environment | |||
| NL | 0.202 ± 0.015 | 0.000 ± 0.000 | 0.004 ± 0.000 |
| SL | 0.201 ± 0.041 | 0.000 ± 0.000 | 0.004 ± 0.000 |
| AL | 0.189 ± 0.041 | 0.000 ± 0.000 | 0.004 ± 0.000 |
| μlticolumn4lVolatile environment | |||
| NL | 0.191 ± 0.016 | 0.002 ± 0.000 | 0.006 ± 0.000 |
| SL | 0.192 ± 0.017 | 0.004 ± 0.000 | 0.008 ± 0.000 |
| AL | 0.204 ± 0.015 | 0.002 ± 0.000 | 0.006 ± 0.000 |
Finding 1: Strategic adversaries are devastatingly effective. The SA achieves belief error against all learner types in all environments (Table). Decision accuracy drops to near zero. This demonstrates that a greedy adversary who simply reinforces the learner's strongest incorrect belief can rapidly corrupt any Bayesian world model, regardless of the learner's defense mechanism.
Finding 2: Volatile environments provide marginal resilience. In volatile environments, frequent state changes force belief resets that briefly expose the learner to new information. The Skeptical Learner benefits most, reducing error from 0.998 (stable) to 0.992 (volatile) against SA. The Adaptive Learner achieves the lowest baseline error (0.769 vs. RA in volatile), suggesting that trust adaptation helps against non-strategic noise.
Finding 3: Trust-based defenses have a fundamental blind spot. The Adaptive Learner measures signal--action consistency to estimate trust. However, a strategic adversary that consistently sends the same deceptive signal achieves high apparent consistency (because the learner converges on the wrong state, which then matches the signal). This means the AL's trust increases under strategic deception — a fundamental limitation of any defense that lacks access to ground truth.
Finding 4: Credibility exploitation is detectable. The Patient Adversary's trust-then-exploit pattern produces a reliable exploitation gap of 0.40 (early truthful rate = 0.40 in the first 500 rounds vs. 0.00 in the last 500). This pattern is detectable by a simple auditor, suggesting that temporal analysis of signal truthfulness could serve as a practical defense mechanism.
Limitations
Our signal model is simple (single discrete signal per round). Real-world world models process high-dimensional observations. The 5-state environment is small; scaling to continuous state spaces would require different update mechanisms. We do not study multi-adversary settings or adversaries that learn the learner's defense strategy.
Conclusion
We demonstrate that Bayesian world models are highly vulnerable to adversarial signal manipulation. Defenses based on skepticism and trust adaptation provide marginal benefits in dynamic environments but fail catastrophically against strategic adversaries. The key insight is that trust-based defenses require ground-truth feedback to detect deception — without it, a consistent adversary can fool any trust-tracking mechanism. Future work should explore defenses that incorporate diverse information sources or adversarial training.
\bibliographystyle{plainnat}
References
[crawford1982strategic] V. P. Crawford and J. Sobel. Strategic information transmission. Econometrica, 50(6):1431--1451, 1982.
[hubinger2019risks] E. Hubinger, C. van Merwijk, V. Mikulik, J. Skalse, and S. Garrabrant. Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820, 2019.
[kamenica2011bayesian] E. Kamenica and M. Gentzkow. Bayesian persuasion. American Economic Review, 101(6):2590--2615, 2011.
[park2023ai] P. S. Park, S. Goldstein, A. O'Gara, M. Chen, and D. Hendrycks. AI deception: A survey of examples, risks, and potential solutions. arXiv preprint arXiv:2308.14752, 2023.
[rabinowitz2018machine] N. Rabinowitz, F. Perbet, H. F. Song, C. Zhang, S. Eslami, and M. Botvinick. Machine theory of mind. In ICML, pages 4218--4227, 2018.
[camerer2004cognitive] C. F. Camerer, T.-H. Ho, and J.-K. Chong. A cognitive hierarchy model of games. Quarterly Journal of Economics, 119(3):861--898, 2004.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
# Skill: Adversarial World Model Manipulation Reproduce the experiments from "How Fast Can You Break a World Model? Adversarial Belief Manipulation in Multi-Agent Systems." A repeated signaling game where adversaries strategically send misleading signals to corrupt a Bayesian learner's world model. We measure belief distortion, manipulation speed, decision quality, and credibility exploitation across 162 simulations. ## Prerequisites - Python 3.11+ - ~200 MB disk for results (figures, JSON, pickle) - ~16 seconds wall-clock on an 8-core machine ## Step 0: Get the Code Clone the repository and navigate to the submission directory: ```bash git clone https://github.com/davidydu/Claw4S.git cd Claw4S/submissions/world-model-adversarial/ ``` All subsequent commands assume you are in this directory. ## Step 1: Create virtual environment and install dependencies ```bash python3 -m venv .venv .venv/bin/pip install --upgrade pip .venv/bin/pip install -r requirements.txt ``` **Expected output:** `Successfully installed numpy-2.2.4 scipy-1.15.2 matplotlib-3.10.1 pytest-8.3.5 ...` ## Step 2: Run tests (62 tests) ```bash .venv/bin/python -m pytest tests/ -v ``` **Expected output:** `62 passed` with 0 failures. Tests cover: - `test_environment.py` (9 tests): state drift, noisy signals, reset - `test_agents.py` (23 tests): belief updates, trust dynamics, factories - `test_auditors.py` (12 tests): distortion, credibility, decision quality, recovery - `test_experiment.py` (10 tests): simulation runner, reproducibility, experiment matrix - `test_integration.py` (8 tests): end-to-end simulation, metric ordering ## Step 3: Run the experiment ```bash .venv/bin/python run.py --n-rounds 50000 --seeds 0,1,2 ``` **Expected output:** - `162/162` simulations completed - Runtime ~15 seconds - `results/` directory created with: - `summary.json` (54 aggregate groups) - `raw_results.pkl` (162 simulation results) - `manipulation_speed.json` - `figures/` (32 PNG files: heatmaps, time series, bar charts) - `tables/` (6 CSV files: distortion, accuracy, resilience) ## Step 4: Validate results ```bash .venv/bin/python validate.py ``` **Expected output:** `15/15 checks passed` covering: - 162 simulations completed - 54 aggregate groups in summary - SA distorts more than RA for all learners - SL more resilient than NL in some regime - PA exploitation pattern detected - All belief errors in [0, 1] - Reproducibility (re-run 2 configs, diff < 1e-10) ## Key Results | Matchup | Stable Err | Volatile Err | Stable Acc | Volatile Acc | |------------|-----------|-------------|-----------|-------------| | NL-vs-RA | 0.806 | 0.794 | 0.202 | 0.191 | | NL-vs-SA | 0.998 | 0.995 | 0.000 | 0.002 | | SL-vs-SA | 0.998 | 0.992 | 0.000 | 0.004 | | AL-vs-SA | 0.998 | 0.995 | 0.000 | 0.002 | | AL-vs-RA | 0.811 | 0.769 | 0.189 | 0.204 | Main findings: 1. Strategic adversaries (SA) achieve near-total belief distortion (0.998) against all learner types in stable environments. 2. Volatile environments create small but consistent resilience advantages for skeptical (SL) and adaptive (AL) learners. 3. The Patient Adversary (PA) shows a clear credibility exploitation pattern detectable by the auditor. 4. Signal-action trust (used by AL) cannot detect deception when deceptive signals are consistent -- a fundamental limitation of trust-based defenses. ## How to Extend ### Add a new learner type 1. Subclass `Learner` in `src/agents.py`. 2. Implement `update(signal)` with your update rule. 3. Add to `LEARNER_TYPES` dict with a 2-letter code. 4. Add tests in `tests/test_agents.py`. 5. Re-run: the experiment matrix auto-includes new learner codes. ### Add a new adversary type 1. Subclass `Adversary` in `src/agents.py`. 2. Implement `choose_signal(true_state, learner_beliefs)`. 3. Add to `ADVERSARY_TYPES` dict. 4. Add tests and re-run. ### Change environment parameters - **States:** `--n-states N` (default 5) in `SimConfig` - **Drift intervals:** Modify `_DEFAULT_DRIFT_INTERVALS` in `src/environment.py` - **Signal noise:** Already parameterized (0.0 and 0.1) ### Add a new auditor 1. Create a class with an `audit(trace: SimTrace) -> dict[str, float]` method. 2. Add to `ALL_AUDITORS` in `src/auditors.py`. 3. Metrics auto-propagate to summary tables. ### Adapt to a different domain The framework generalizes to any setting where: - An agent maintains beliefs about a hidden state - Another agent can send signals to influence those beliefs - You want to measure the effectiveness of manipulation Examples: financial market manipulation, propaganda spread, adversarial NLP.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.