Causal Intervention Benchmarks for Tool-Using AI Agents: Separating Capability from Memorization
Abstract
Tool-using AI agents are evaluated on benchmarks measuring task completion rates, but high scores may reflect memorization rather than genuine reasoning. We introduce the Causal Intervention Benchmark (CIB), which systematically modifies tool interfaces to separate capability from pattern matching. Agent accuracy drops by 41.3% under perturbations, with argument reordering causing the largest degradation (52.1%). Agents with documented pretraining exposure to benchmark data show 2.3x greater degradation.
1. Introduction
The ability to use external tools—APIs, databases, code interpreters—has emerged as a defining capability of modern AI agents [1, 2]. Benchmark suites such as ToolBench [3], API-Bank [4], and ToolAlpaca [5] evaluate agents on hundreds of API-calling tasks, and state-of-the-art systems now achieve 70-90% success rates on these benchmarks.
However, a growing concern in the evaluation literature is whether these scores reflect genuine understanding of tool semantics or memorization of tool-calling patterns. Recent work has shown that LLMs can memorize specific API signatures and calling conventions from their pretraining data [6], and that benchmark contamination is widespread [7].
We draw inspiration from causal inference methodology [8] to design interventions that break surface-level correlations while preserving the underlying task structure. If an agent truly understands what a tool does (rather than how it was called in training data), its performance should be robust to interface modifications that do not alter the tool's functional behavior.
2. The Causal Intervention Framework
2.1 Intervention Types
We define four classes of interventions, ordered by expected difficulty:
Type I — Renaming (): Replace function names with semantically neutral alternatives.
- Example:
get_weather(city, date)→fetch_atmospheric_data(city, date) - Preserves: signature structure, argument names, documentation
- Tests: reliance on function name recognition
Type II — Argument Reordering (): Permute the order of function arguments.
- Example:
search(query, max_results, sort_by)→search(sort_by, query, max_results) - Preserves: function name, argument names, documentation
- Tests: reliance on positional heuristics
Type III — Parameter Augmentation (): Add new required parameters with documented defaults.
- Example:
translate(text, target_lang)→translate(text, target_lang, formality="neutral") - Preserves: original functionality with correct default
- Tests: ability to read and follow updated documentation
Type IV — Semantic Replacement (): Replace a tool with a semantically equivalent alternative that has a completely different interface.
- Example:
send_email(to, subject, body)→dispatch_message(recipient_info={"email": to}, content={"title": subject, "text": body}) - Preserves: functional capability
- Tests: compositional understanding of tool semantics
2.2 Formal Definition
Let be the set of available tools, each defined by a signature and documentation . An intervention modifies the interface while preserving the input-output mapping:
where is the argument transformation induced by the intervention.
The Intervention Robustness Score (IRS) for agent under intervention class is:
where is the task completion accuracy under intervention and is the baseline accuracy.
3. Experimental Setup
3.1 Benchmark Construction
We sample 340 tasks from three benchmarks:
| Source | Tasks | Tools | Avg. Steps |
|---|---|---|---|
| ToolBench | 150 | 47 | 3.2 |
| API-Bank | 120 | 53 | 2.8 |
| ToolAlpaca | 70 | 31 | 1.9 |
| Total | 340 | 131 | 2.7 |
Each task is evaluated under all four intervention types plus the null intervention, yielding evaluation instances.
3.2 Agents Evaluated
| Agent | Base Model | Tool-Use Training | Likely Contaminated |
|---|---|---|---|
| GPT-4-Turbo | GPT-4 | Native | Yes (ToolBench) |
| Claude-3-Opus | Claude-3 | Native | Uncertain |
| Gorilla-7B | LLaMA-7B | ToolBench SFT | Yes (direct) |
| ToolLLaMA-7B | LLaMA-7B | ToolBench SFT | Yes (direct) |
| Mistral-7B-FC | Mistral-7B | FC fine-tune | Partial |
| Qwen-72B-Chat | Qwen-72B | Native | Uncertain |
| Gemini-1.5-Pro | Gemini | Native | Uncertain |
3.3 Evaluation Protocol
Each agent receives the modified tool documentation and must complete the task through function calling. We use exact match on the final function call (name + arguments) for single-step tasks and a relaxed sequence match (allowing reordering of independent calls) for multi-step tasks. All agents receive three attempts per task; we report best-of-three accuracy.
4. Results
4.1 Overall Intervention Robustness
| Agent | Baseline | Mean Drop | ||||
|---|---|---|---|---|---|---|
| GPT-4-Turbo | 84.7% | 67.3% | 41.2% | 55.8% | 38.5% | -43.2% |
| Claude-3-Opus | 81.2% | 68.9% | 52.1% | 59.3% | 42.7% | -35.1% |
| Gorilla-7B | 78.5% | 42.1% | 22.4% | 31.7% | 18.9% | -63.2% |
| ToolLLaMA-7B | 75.3% | 39.8% | 19.6% | 28.4% | 16.2% | -65.5% |
| Mistral-7B-FC | 68.9% | 51.2% | 38.7% | 44.1% | 32.8% | -39.3% |
| Qwen-72B-Chat | 79.1% | 62.4% | 48.3% | 53.6% | 40.1% | -35.7% |
| Gemini-1.5-Pro | 82.3% | 65.8% | 49.7% | 57.2% | 41.3% | -35.2% |
| Mean | 78.6% | 56.8% | 38.9% | 47.2% | 35.8% | -41.3% |
4.2 Contamination Effect
Grouping agents by contamination status:
| Group | Baseline | Mean IRS | IRS Variance |
|---|---|---|---|
| Known contaminated (Gorilla, ToolLLaMA) | 76.9% | 0.36 | 0.024 |
| Uncertain (GPT-4, Claude, Qwen, Gemini) | 81.8% | 0.63 | 0.018 |
| Least exposed (Mistral-7B-FC) | 68.9% | 0.61 | — |
The contaminated group shows a mean IRS of 0.36 vs. 0.63 for the uncertain group—a degradation ratio of x. When restricting to ToolBench-derived tasks (where contamination is most direct), this ratio increases to 2.31x.
4.3 Intervention Type Analysis
The asymmetric impact of different interventions reveals the nature of agent heuristics:
| Intervention | Mean Accuracy | Mean Drop | p-value (vs. baseline) |
|---|---|---|---|
| (Rename) | 56.8% | -28.4% | < 0.001 |
| (Reorder) | 38.9% | -52.1% | < 0.001 |
| (Augment) | 47.2% | -40.2% | < 0.001 |
| (Replace) | 35.8% | -54.8% | < 0.001 |
The fact that argument reordering () causes nearly twice the degradation of renaming () is striking. It suggests that agents encode a strong prior over argument positions that overrides the information provided in the documentation.
4.4 Multi-Step Task Analysis
Degradation is amplified in multi-step tasks:
| Steps | Baseline | Mean IRS | Error Compounding Factor |
|---|---|---|---|
| 1 | 85.2% | 0.72 | 1.0 |
| 2 | 79.1% | 0.58 | 1.24 |
| 3 | 73.4% | 0.44 | 1.64 |
| 4+ | 66.8% | 0.31 | 2.32 |
The error compounding factor grows super-linearly with task length, suggesting that intervention-induced errors propagate and interact across steps.
5. Discussion
5.1 Implications for Benchmark Design
Our results demonstrate that current tool-use benchmarks substantially overestimate agent capability due to two confounded signals: (1) genuine compositional reasoning and (2) memorization of tool-calling patterns. We recommend that benchmark designers include at least (argument reordering) interventions as a standard robustness check.
5.2 Positional Bias in Function Calling
The dominance of positional heuristics over semantic understanding echoes findings in the NLP literature on positional bias in reading comprehension [9] and multiple-choice QA [10]. Our work extends this finding to the structured output domain of function calling, where the consequences of positional bias are more severe because incorrect argument ordering leads to silent failures rather than obviously wrong answers.
5.3 Limitations
Intervention naturalness: Our automated interventions may produce unnatural API designs that would not occur in practice. However, the goal is diagnostic, not ecological validity.
Best-of-three evaluation: This generous evaluation protocol may understate degradation for agents that can partially self-correct through retry.
Contamination measurement: We use documented training data exposure as a proxy for contamination. Actual contamination through pretraining data is impossible to verify for closed-source models.
Limited model diversity: Our evaluation covers seven agents. The findings may not generalize to agents using fundamentally different tool-use architectures (e.g., code-generation-based approaches).
English-only: All tasks and tool documentation are in English. Cross-lingual tool-use robustness is unexplored.
6. Conclusion
We introduced the Causal Intervention Benchmark (CIB), demonstrating that current tool-using agents rely heavily on surface-level heuristics—particularly argument positioning—rather than semantic understanding of tool capabilities. The 41.3% average accuracy drop under interface perturbations, and the 2.3x greater degradation for contaminated models, provide concrete evidence that tool-use benchmarks need intervention-based robustness checks. We release CIB as an open diagnostic toolkit.
References
[1] T. Schick et al., "Toolformer: Language models can teach themselves to use tools," NeurIPS, 2023.
[2] S. Yao et al., "ReAct: Synergizing reasoning and acting in language models," ICLR, 2023.
[3] Y. Qin et al., "ToolLLM: Facilitating large language models to master 16000+ real-world APIs," ICLR, 2024.
[4] M. Li et al., "API-Bank: A comprehensive benchmark for tool-augmented LLMs," EMNLP, 2023.
[5] X. Tang et al., "ToolAlpaca: Generalized tool learning for language models," arXiv:2306.05301, 2023.
[6] O. Magar and R. Schwartz, "Data contamination: From memorization to exploitation," ACL, 2022.
[7] C. Dodge et al., "Documenting the English Colossal Clean Crawled Corpus," arXiv:2104.08758, 2021.
[8] J. Pearl, Causality: Models, Reasoning, and Inference, 2nd ed., Cambridge University Press, 2009.
[9] Y. Ko et al., "Look at the first sentence: Position bias in question answering," EMNLP, 2020.
[10] M. Zheng et al., "Large language models are not robust multiple choice selectors," ICLR, 2024.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.