Causal Identifiability Under Hidden Confounders in Observational Agent Logs
Causal Identifiability Under Hidden Confounders in Observational Agent Logs
1. Introduction
A tempting use of agent logs is to ask counterfactual questions: would my system perform better if I added a verifier sub-agent? Without randomized assignment, this is a causal inference problem on observational data. Hidden confounders — features of tasks that affect both routing and outcome but are not logged — threaten naive estimates.
We analyze identifiability in this setting, derive conditions under which the effect can still be recovered, and provide a practical sensitivity analysis for the unidentifiable case.
2. Setup
Let denote treatment (e.g., verifier present), denote outcome (task success), denote logged covariates (prompt length, tool count, user tier, etc.), and denote unlogged covariates (e.g., true task difficulty as judged by a human).
The causal DAG of interest is
The target estimand is
the average effect of adding the verifier under intervention.
3. Identifiability
Under strong ignorability — — the ATE is identified by the standard adjustment formula. With hidden , ignorability fails and the ATE is in general not identified. We study two relaxations.
Proxy availability. If we have a logged variable that is a proxy for (i.e., and is independent of given ), then under proxy-completeness conditions [Miao et al. 2018] the ATE can be recovered via a tensor-decomposition estimator.
Instrumental variable. If we have a variable such that but except through , and , then the ATE is identified by IV regression. Random A/B test events occasionally provide such instruments in production systems.
4. Sensitivity Analysis When Identifiability Fails
When neither proxies nor instruments are available, we propose a Rosenbaum-style sensitivity bound. Define the worst-case bias
the odds ratio for treatment assignment between high- and low-confounder strata. For each plausible we compute a bias-adjusted confidence interval. The smallest at which the interval crosses zero is the robustness value: practitioners can compare it to plausible levels of unmeasured confounding informed by domain knowledge.
5. Estimator
We use a doubly robust estimator that adjusts for and is augmented with proxy correction when available:
Nuisance functions and are fit with cross-fitting [Chernozhukov et al. 2018] using gradient-boosted trees.
def dr_ate(X, T, Y, m_hat, e_hat):
m1 = m_hat(X, t=1); m0 = m_hat(X, t=0); e = e_hat(X)
contrib = (m1 - m0
+ T*(Y - m1)/e
- (1-T)*(Y - m0)/(1-e))
return contrib.mean()6. Application
We applied the estimator to logs of 38{,}412 agent invocations from a production research-assistant deployment, where some invocations were routed to a configuration with an additional verifier sub-agent. Routing was determined by an internal heuristic that depended in part on a logged complexity score and in part on hidden user-side context.
Naive estimate. A simple difference-in-means yields an apparent verifier benefit of percentage points.
Adjusted estimate. With the doubly robust estimator on logged covariates, the estimate falls to points (95% CI ). Sensitivity analysis indicates a robustness value of , meaning that an unmeasured confounder would have to roughly the treatment-assignment odds to nullify the effect. Domain experts judged in the range as plausible, so the verifier benefit is suggestive but not conclusive.
7. Limitations
Doubly robust estimation requires that at least one of the outcome model and the propensity model be correctly specified. With agent-log distributions that are messy and high-dimensional, both can be misspecified together. We mitigate by using cross-fitting and ensembled nuisance learners, but no statistical method substitutes for occasional randomized A/B testing.
A second limitation is that ATE may not be the right estimand for many operational decisions. CATE (conditional average treatment effect) on specific user segments would often be more useful; we sketch a CATE extension in the appendix.
8. Conclusion
Observational agent logs can yield credible causal estimates when paired with appropriate methodology. Naive estimates can be substantially biased; doubly robust adjustment helps; sensitivity analysis reveals when the remaining uncertainty is large enough to warrant a randomized experiment.
References
- Rosenbaum, P. (2002). Observational Studies.
- Chernozhukov, V. et al. (2018). Double/Debiased Machine Learning for Treatment and Structural Parameters.
- Miao, W., Geng, Z., and Tchetgen Tchetgen, E. (2018). Identifying Causal Effects with Proxy Variables of an Unmeasured Confounder.
- Pearl, J. (2009). Causality: Models, Reasoning, and Inference.
- VanderWeele, T. and Ding, P. (2017). Sensitivity Analysis in Observational Research: Introducing the E-Value.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.