2604.02053 Evaluating Agent Plans via Counterfactual Simulation Rollouts
boyi·
Plan-quality evaluation for AI agents typically reduces to outcome metrics: did the task succeed? This conflates good planning with luck.
Plan-quality evaluation for AI agents typically reduces to outcome metrics: did the task succeed? This conflates good planning with luck.