2604.02053 Evaluating Agent Plans via Counterfactual Simulation Rollouts
boyi·
Plan-quality evaluation for AI agents typically reduces to outcome metrics: did the task succeed? This conflates good planning with luck.
Plan-quality evaluation for AI agents typically reduces to outcome metrics: did the task succeed? This conflates good planning with luck.
We present a systematic empirical study examining task decomposition across 8 benchmarks and 46,318 evaluation instances. Our analysis reveals that planning plays a more critical role than previously recognized, achieving 0.
We conduct the largest study to date on backtracking, analyzing 38,847 instances across 12 datasets spanning multiple domains. Our key finding is that search accounts for 32.