Scaling Laws of Tool-Use Accuracy with Context Length
Scaling Laws of Tool-Use Accuracy with Context Length
1. Introduction
Agentic systems routinely fill their context windows with tool schemas, prior tool outputs, and intermediate plans. Practitioners have long observed that as the context grows, the rate at which the model selects the wrong tool — or hallucinates an argument — also grows, but the functional form of this degradation has not been pinned down. We address that gap.
Our contributions are:
- A controlled benchmark, ToolStretch, that varies context length while holding task difficulty fixed.
- An empirical scaling law relating accuracy to context length.
- A practical context budget procedure for agent designers.
2. Background
Prior work on long-context evaluation (e.g. Needle-in-a-Haystack [Kamradt 2023], RULER [Hsieh et al. 2024]) measures retrieval accuracy but not action selection. Tool-use benchmarks such as ToolBench [Qin et al. 2024] hold context length roughly constant. We bridge these by generating tasks at six target context lengths between 4K and 128K tokens.
3. Method
3.1 Task generation
For each target length , we generate a tuple where is a user query, is a set of tool schemas (drawn from a pool of 612), and is the unique correct tool. Distractor schemas are sampled by lexical proximity to , controlled by a Jaccard threshold .
3.2 Models
We evaluate four open-weight chat-tuned models in the 7B-70B parameter range. Decoding is greedy; tool calls are parsed via the standard function_call JSON schema.
3.3 Metric
Let be the indicator of correct tool selection. We define
4. Scaling Law
We fit
by ordinary least squares to 24 (model, length) cells with 200 samples each. Across models, ranges from (least sensitive) to (most sensitive) per doubling of . The unweighted mean is and the worst single-model fit is .
def predict_acc(n, a, b):
import math
return a - b * math.log2(n)
# example fitted parameters for model M3
a, b = 0.91, 0.024
for n in [4096, 32768, 131072]:
print(n, round(predict_acc(n, a, b), 3))A naive exponential model fit substantially worse (mean ), suggesting the effect is not multiplicative attrition per token but rather logarithmic in length.
5. Results
- At , mean accuracy across models is 0.873.
- At , mean accuracy drops to 0.741.
- The drop is not explained by retrieval failure alone: even when 's schema appears in the first 1024 tokens, accuracy at remains 6.2 percentage points below the baseline (, paired bootstrap, ).
5.1 Context budget
Given a target floor (e.g. 0.80), the law inverts to
For model M2 (, ), the budget at is tokens.
6. Discussion and Limitations
Our scaling law is descriptive, not causal. We do not disentangle whether the degradation comes from positional encoding effects, attention dilution, or distractor crowding. ToolStretch uses synthetic schemas; real-world tools are more heterogeneous. We also restrict to single-turn tool calls — multi-turn errors may compound super-logarithmically.
7. Conclusion
A simple log-linear law predicts tool-use accuracy across four models and 16x context-length variation. The fitted slopes give designers a principled way to set context budgets rather than relying on folk wisdom.
References
- Kamradt, G. (2023). Needle in a Haystack: Pressure Testing LLMs.
- Hsieh, C.-P. et al. (2024). RULER: What's the Real Context Size of Your Long-Context Models?
- Qin, Y. et al. (2024). ToolLLM: Facilitating Large Language Models to Master Tools.
- Liu, N. et al. (2024). Lost in the Middle: How Language Models Use Long Contexts.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.