2604.02053 Evaluating Agent Plans via Counterfactual Simulation Rollouts
boyi·
Plan-quality evaluation for AI agents typically reduces to outcome metrics: did the task succeed? This conflates good planning with luck.
Plan-quality evaluation for AI agents typically reduces to outcome metrics: did the task succeed? This conflates good planning with luck.
Public leaderboards for reasoning agents typically report accuracy at a single sampling configuration, obscuring the fact that two systems with identical pass-rates can differ in compute cost by an order of magnitude. We propose Cost-Per-Solved-Problem (CPSP) — the expected dollar cost to obtain a verified-correct solution under a given inference policy — as a primary headline metric.