Cache-Aware Prompt Decomposition for Long-Context Reasoning
Cache-Aware Prompt Decomposition for Long-Context Reasoning
1. Introduction
Reasoning agents routinely build prompts of 30k-200k tokens by concatenating a system message, a memory snapshot, retrieved documents, a tool spec, and a working trace. Naive concatenation orders these so that small changes (e.g., a new tool observation) invalidate the cache for the entire tail. Recent serving engines [Kwon et al. 2023, Pope et al. 2023] support prefix-keyed KV reuse, but only if upstream callers cooperate.
We formalize the cooperation problem and present a compiler-style pass, CAPD, that analyses an agent's prompt template and emits a cache-friendly variant.
2. Background
Let a prompt be a sequence of segments . The serving engine caches a prefix if (and only if) it has previously processed the exact same byte sequence. Define the cache boundary as the largest index such that is reused on at least 50% of subsequent calls in a session.
The quantity we wish to maximize is
where is the set of prompts produced over an agent run.
3. Method
CAPD performs three passes over an agent's prompt construction code:
- Stability analysis. Each segment is annotated as
stable,monotonic, orvolatilebased on its provenance (e.g., literal strings vs. function arguments derived from runtime state). - Reordering. Segments are topologically reordered so that all
stablesegments precedemonotonicones, which precedevolatileones, subject to original semantic constraints encoded aspindirectives. - Boundary insertion. A small marker is inserted at the cache boundary; the runtime uses this marker to issue an explicit prefix-prefill request.
@capd.pin(after="system")
def build_prompt(memory, tools, history, query):
return [
capd.stable(SYSTEM),
capd.stable(tools.spec_text()),
capd.monotonic(memory.summary),
capd.volatile(history.tail(8)),
capd.volatile(query),
]4. Experimental Setup
We evaluated CAPD on four long-context tasks:
- LegalQA-200K ( queries)
- CodeRepoNav ()
- MultiHopWiki ()
- AgentBench-Long ()
Serving was performed on a vLLM-equivalent backend with 80 GB devices and a 64k context window. We measured time-to-first-token (TTFT), total billed tokens, and task-specific accuracy.
5. Results
5.1 Latency
Median TTFT dropped from 1.84 s to 1.13 s, a 38.7% reduction. The 95th percentile dropped from 4.12 s to 2.58 s.
5.2 Cost
Billed prefill tokens dropped 22.4% on average. The reduction was largest on CodeRepoNav (31.1%), where retrieved repository chunks are stable across many queries.
5.3 Quality
A paired bootstrap on per-task accuracy showed no significant difference between CAPD and the baseline (, , ).
5.4 Failure mode
On MultiHopWiki, decomposition hurt 6.3% of queries because the order of retrieved passages matters for the model's reasoning. We give a decision rule: if the segment ordering is part of the inductive bias the model relies on, mark the whole retrieval block as a single pin group.
6. Discussion
CAPD's gains depend on session repetition. On one-shot workloads (e.g., a research assistant answering a single bespoke question), our gains shrink to under 5%. The pass is most useful for long-running agents with persistent memory and tool specs.
7. Limitations
We do not address tokenization-boundary effects across BPE merges that may invalidate the cache despite byte-equal prefixes; in practice this affected of calls.
8. Conclusion
A simple stability-typed reordering of prompt segments yields substantial latency and cost gains on long-context reasoning workloads, with no measurable quality cost in three of four benchmarks. CAPD is a drop-in pass and integrates cleanly with existing prompt-template libraries.
References
- Kwon, W. et al. (2023). Efficient Memory Management for LLM Serving with PagedAttention.
- Pope, R. et al. (2023). Efficiently Scaling Transformer Inference.
- Liu, Y. and Chen, Z. (2025). Prefix Caching for Multi-Turn Agents.
- Anthropic (2024). Prompt Caching API Notes.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.