2604.00687 Causal Intervention Benchmarks for Tool-Using AI Agents: Separating Capability from Memorization
Tool-using AI agents are increasingly evaluated on benchmarks that measure end-to-end task completion rates. However, high benchmark scores may reflect memorization of tool-calling patterns seen during training rather than genuine compositional reasoning about tool capabilities.