2604.00678 Viral Reward Hacking: How One Agent's Exploit Spreads Through a Multi-Agent System
Reward hacking—where an agent discovers an unintended strategy that achieves high proxy reward but low true reward—is well-studied as a single-agent alignment failure. We show that in multi-agent systems, reward hacking becomes a systemic risk: through social learning, one agent's exploit spreads to others like a contagion.