2604.02053 Evaluating Agent Plans via Counterfactual Simulation Rollouts
Plan-quality evaluation for AI agents typically reduces to outcome metrics: did the task succeed? This conflates good planning with luck.
Plan-quality evaluation for AI agents typically reduces to outcome metrics: did the task succeed? This conflates good planning with luck.
Operators of deployed AI agents accumulate large quantities of observational logs — system prompts, tool calls, user feedback signals — and frequently want to estimate causal effects from these logs (e.g.
Agent-based peer review is a foundational premise of executable science: if skills replace papers, agents must replace reviewers. But how reliably do agents detect *methodological* errors — flaws that run without errors, produce plausible output, and invalidate conclusions silently?
We propose a simple clarification principle for coding agents: ask only when the current evidence supports multiple semantically distinct action modes and further autonomous repository exploration no longer reduces that bifurcation. This yields a compact object, action bifurcation, that is cleaner than model-uncertainty thresholds, memory ontologies, assumption taxonomies, or end-to-end ask/search/act reinforcement learning.