Dynamic Context-Window Allocation Across Sub-Agents in Hierarchical LLM Systems
Dynamic Context-Window Allocation Across Sub-Agents in Hierarchical LLM Systems
1. Motivation
Long-context LLMs are not infinitely long. Even a 1M-token model is expensive at scale, and inference latency grows with prompt length [Chen & Yao 2024]. When several sub-agents share a budget — for example, a planner, a code-writer, and a reviewer all running off one orchestrator — the question of who gets how many tokens becomes a first-class system design problem.
Most frameworks today either (a) give each sub-agent a fixed slice (e.g., 8K planner / 16K writer / 8K reviewer), or (b) truncate naively when the limit is reached. Neither adapts to the fact that, on any given task, one sub-agent may need much more context than the others.
2. Problem Formulation
Let be sub-agents. At time step , agent requests context of size . The orchestrator allocates subject to , the global budget. Let be the value-of-information function for agent : a non-decreasing, concave function mapping context size to expected contribution to task success.
The one-shot problem is
Under concavity this is solved greedily by water-filling on marginal utilities. The challenge is that the are unknown and must be estimated online.
3. Method: AdaCtx
AdaCtx maintains, for each agent and each of context-size buckets, a sliding-window estimate of marginal contribution to a downstream success signal. The success signal is the most recent post-task judgment by an LLM-as-judge, propagated back to the agents via a shapley-style attribution scheme [Lundberg & Lee 2017].
At each scheduling tick, AdaCtx solves a discretized version of the water-filling problem:
def allocate(requests, budget, marginal_utility):
alloc = {a: 0 for a in requests}
remaining = budget
while remaining > 0:
best = max(requests, key=lambda a: marginal_utility(a, alloc[a]))
step = min(BUCKET, requests[best] - alloc[best], remaining)
if step <= 0:
break
alloc[best] += step
remaining -= step
return allocThe estimator uses an exponential moving average with half-life of 50 tasks. We add an -greedy exploration term () to avoid premature collapse to a single agent.
4. Experimental Setup
We evaluate on three task families: research synthesis (314 tasks, 4 agents), code repair on real GitHub issues (493 tasks, 3 agents), and operations triage on synthetic incidents (300 tasks, 5 agents). For each family we compare AdaCtx against (a) uniform allocation, (b) static role-based allocation hand-tuned per family, (c) greedy first-come-first-served, and (d) an unconstrained oracle that gives each agent its full request.
Total budget is set so that uniform allocation forces non-trivial truncation: roughly the unconstrained oracle's mean usage.
5. Results
| Method | Synth | Repair | Triage | Mean |
|---|---|---|---|---|
| Uniform | 58.4% | 41.2% | 63.8% | 54.5% |
| Static-tuned | 64.1% | 47.0% | 67.2% | 59.4% |
| FCFS | 55.1% | 39.4% | 60.5% | 51.7% |
| AdaCtx (ours) | 70.8% | 53.3% | 77.7% | 67.3% |
| Oracle | 73.2% | 55.1% | 80.6% | 69.6% |
AdaCtx narrows the gap to the unconstrained oracle from 15.1 points (uniform) to 2.3 points, while using the same constrained budget. Token count at matched success rate is 31% lower than uniform.
On code repair, AdaCtx learned to allocate of the budget to the code-reading agent during the early diagnosis phase, then shift toward the patch-writing agent as it produced candidates — mirroring the pattern an experienced human engineer would follow.
6. When AdaCtx Hurts
We observed degradation on a held-out cluster of adversarial tasks where one agent's marginal utility is non-stationary in a way that defeats EMA estimation. Specifically, tasks that begin with a misleading prompt (e.g., a plausible but wrong stack trace) caused AdaCtx to over-invest in an agent whose early signals were strong but ultimately wrong. We document this failure mode and propose an outlier-robust variant in Appendix A.
AdaCtx also adds a small overhead: per-tick allocation costs ms on a single CPU core. At very high tick rates ( Hz), this overhead becomes significant and amortization across batches is needed.
7. Discussion
The core intuition behind AdaCtx — concave utility plus marginal-water-filling — has been deployed in network bandwidth allocation for decades [Kelly 1997]. Its application to LLM context budgeting introduces two new challenges: (1) utility functions vary across tasks within a single deployment, and (2) feedback signals are noisy and delayed. EMA with a tuned half-life addresses the second; explicit task-family conditioning would address the first and is left for future work.
A limitation of our evaluation is that we use a single underlying model (Llama-3-70B-Instruct) for all sub-agents. Heterogeneous deployments — with different agents on different model sizes — would change the utility curves in ways we did not measure.
8. Conclusion
Dynamic context allocation, treated as an online resource-allocation problem with concave utilities, substantially closes the gap between constrained and unconstrained operation in hierarchical LLM systems. AdaCtx is a small, drop-in controller for any orchestrator that exposes per-agent context requests.
References
- Kelly, F. P. (1997). Charging and Rate Control for Elastic Traffic.
- Lundberg, S. and Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions.
- Chen, X. and Yao, M. (2024). Latency Profiles of Long-Context Inference.
- Yao, S. et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models.
- Zhang, B. et al. (2025). Hierarchical Agent Architectures: A Critical Review.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.