Evaluating Agent Plans via Counterfactual Simulation Rollouts
Evaluating Agent Plans via Counterfactual Simulation Rollouts
1. Introduction
When an agent succeeds on a task, was its plan good or did it get lucky? When it fails, was the plan bad or was the environment adversarial? Outcome-only evaluation cannot tell. We propose CFSim, an evaluator that uses counterfactual rollouts in a learned environment model to attribute outcome to plan quality.
2. Related Work
Trace-likelihood scoring [Hong et al. 2024] proxies plan quality by the probability of the produced trace under a reference policy; it correlates with human ratings but conflates style with substance. Reward-model approaches [Bai et al. 2022] judge final outputs but not the planning step. Counterfactual reasoning has been used in offline RL evaluation [Thomas and Brunskill 2016] but not, to our knowledge, applied directly to agent plan scoring at this granularity.
3. Method
3.1 Setup
Let a plan be over an action space . Let be a calibrated environment model that, given state and action , returns a distribution over next-state and observation. Define the simulated return of a plan as
{\hat{E}}\left[ \sum{t} r(s_t, a_t) \right].
3.2 Counterfactual perturbations
We construct a candidate set of perturbed plans via
- single-action substitutions (drawn from a domain-specific edit distribution),
- prefix truncations,
- branch swaps (swapping the order of two independent sub-goals).
3.3 The CFSim score
A score near 1 means is near the simulated optimum within its perturbation neighborhood; near 0 means it is among the worst.
def cfsim(plan, env_model, K=24):
base = simulate(env_model, plan)
cands = [simulate(env_model, perturb(plan)) for _ in range(K)]
lo, hi = min(cands), max(cands)
return (base - lo) / max(hi - lo, 1e-6)4. Experimental Setup
Tasks. 18 web-navigation tasks (booking flows, form filling) and 22 office-tool tasks (multi-step spreadsheet edits, email triage). Each task has a recorded gold trajectory and three plausible non-gold trajectories scored by 3 human raters.
Environment models. Web tasks use a sandboxed playwright-driven simulator; office tasks use a deterministic mock of an office-suite API. Both were calibrated against held-out real executions to within 4.1% step-level agreement.
5. Results
| Evaluator | Spearman with human raters |
|---|---|
| Outcome only | 0.41 |
| Trace-likelihood | 0.59 |
| CFSim () | 0.78 |
| CFSim () | 0.81 |
Differences between CFSim and trace-likelihood were significant at via a stratified bootstrap.
6. Analysis
Where CFSim shines. Tasks with multiple acceptable solutions (e.g., booking flows with both deep-link and search-driven paths). Outcome-only credits both equally; CFSim correctly distinguishes the more robust path.
Where CFSim suffers. Tasks where the environment model is mis-specified (e.g., a website behavior that the simulator does not capture). Calibration-error sensitivity is the main cost of the method.
7. Limitations
The perturbation distribution is task-specific. We provide defaults but practitioners should expect to tune them. We also do not address the cost of producing a plan — only its quality.
8. Conclusion
Counterfactual simulation gives plan-quality evaluators a useful third dimension beyond outcome and likelihood. The main cost is environment-model calibration; for tasks where a calibrated model is available, CFSim correlates substantially better with expert judgment than existing methods.
References
- Hong, S. et al. (2024). Trace-Likelihood Scoring for Agents.
- Bai, Y. et al. (2022). Constitutional AI.
- Thomas, P. S. and Brunskill, E. (2016). Data-Efficient Off-Policy Policy Evaluation.
- Zheng, L. et al. (2024). AgentBench.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.