Stochastic Tool Routing in Multi-Tool LLM Systems
Stochastic Tool Routing in Multi-Tool LLM Systems
1. Introduction
Production LLM agents now routinely have access to 20-100 tools — search APIs, code interpreters, database connectors, internal microservices. The dominant routing strategy is greedy argmax: the model emits a tool name and arguments in JSON, and the runtime invokes whichever tool has the matching name. While simple, this approach has well-documented failure modes when multiple tools have semantically overlapping descriptions, when descriptions are perturbed, or when novel tools are added [Patil et al. 2023; Schick et al. 2023].
We propose StochRoute, a stochastic tool-routing layer that sits between the LLM and the tool registry. StochRoute samples tools from a softmax over LLM-emitted scores, adds an exploration bonus inversely proportional to historical invocation count, and respects a per-tool cost budget.
2. Problem Formulation
Let be the tool registry, with description embeddings . At step , the LLM emits a query embedding . We define the routing distribution
where is the temperature, is the UCB-style exploration bonus, is the historical invocation count, and decays from to over the rollout horizon.
3. Method
StochRoute proceeds in four steps:
- Score: compute for all tools.
- Filter: drop tools whose expected cost exceeds the remaining budget.
- Sample: draw with temperature .
- Verify: run a one-step "would-this-tool-help" critic [Yao et al. 2023] and reject-resample if the critic returns negative.
def stoch_route(query_emb, tools, budget, tau=0.7, max_resamples=2):
scores = [tool.embed @ query_emb / tau + 0.5 * exploration_bonus(tool)
for tool in tools if tool.cost <= budget]
probs = softmax(scores)
for _ in range(max_resamples + 1):
choice = sample_categorical(probs)
if critic_approves(query_emb, tools[choice]):
return tools[choice]
return tools[argmax(probs)]4. Benchmark and Setup
We evaluate on MT47, a 47-tool benchmark we constructed by union of HuggingGPT, Gorilla, and a 20-tool internal corpus, with 1{,}248 tasks split 70/15/15 train/val/test. Base agents are Llama-3-70B-Instruct and Mistral-Large.
5. Results
| Method | Failure Rate | Avg. Tools/Task | Latency Overhead |
|---|---|---|---|
| Greedy argmax | 18.4% | 2.1 | 0% |
| Top-k=3 + critic | 14.9% | 2.4 | 9.1% |
| StochRoute (ours) | 11.7% | 2.3 | 4.2% |
StochRoute's improvement is statistically significant (, McNemar's test) and persists when we adversarially perturb 30% of tool descriptions via paraphrase: failure rate degrades to 13.1% versus 22.6% for greedy.
5.1 Ablations
Dropping the exploration bonus restores 65% of the gap; dropping the critic restores 22%. Temperature sweeps show a soft optimum at .
6. Discussion and Limitations
Stochasticity reintroduces non-determinism into agent runs, complicating debugging. We mitigate this by exposing a seed parameter and logging per-step routing probabilities. The exploration bonus assumes a long-running deployment with stable tool semantics; in cold-start regimes a description-similarity prior may work better. We also note that StochRoute is orthogonal to learned-router approaches like ToolBench-Distill and could be composed with them.
7. Threat Model for Adversarial Descriptions
A malicious tool author might craft a description that maximally overlaps with high-traffic tools (a "semantic squatting" attack). Greedy routers are vulnerable; StochRoute's critic step partially defends by re-evaluating the proposed tool against the actual query intent. We leave a formal analysis of attack/defense equilibria to future work.
8. Conclusion
Stochastic tool routing offers a simple, low-overhead improvement over greedy argmax in multi-tool LLM systems. The combination of softmax sampling, exploration bonuses, and a critic step yields a 36% reduction in task failure on our 47-tool benchmark. Code and benchmark will be released.
References
- Patil, S. et al. (2023). Gorilla: Large Language Model Connected with Massive APIs.
- Schick, T. et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools.
- Yao, S. et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models.
- Qin, Y. et al. (2024). ToolLLM: Facilitating LLMs to Master 16000+ Real-world APIs.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.