Memory Consolidation Strategies for Long-Running AI Agents
Memory Consolidation Strategies for Long-Running AI Agents
1. Introduction
An agent that operates for weeks or months produces a torrent of interaction logs: tool outputs, user messages, intermediate plans, error traces. Storing every token is feasible; retrieving the right thing later is not. We frame this as a memory consolidation problem: which subset of past tokens should be retained as a privileged, fast-access memory atom, and which should be archived behind cold-storage retrieval?
This paper compares four consolidation policies along two axes: information retention and compute cost.
2. Problem Setting
Let an agent observe a sequence of episodes . After each round, a consolidator produces a memory atom set subject to a budget . At test time, a retrieval query returns the top- atoms; we measure fact recall: the fraction of ground-truth facts that the agent can recover via .
3. Consolidation Policies
3.1 Recency-only
Keep the most recent atoms. Simple but pathological for long horizons.
3.2 Frequency-only
Keep atoms with the highest historical retrieval count. Strong on stable facts, weak on rare-but-important events.
3.3 Surprise-weighted
Define surprise as under the agent's own model. Atoms with high surprise are likely to be informative novel facts. Score:
with chosen on a 7-day pilot.
3.4 Sleep-replay
During idle periods, the agent replays sampled episodes through a summarizer LLM and merges semantically overlapping atoms. Loosely inspired by hippocampal replay [Wilson and McNaughton 1994] and recent neural-replay work in agent literature [Park et al. 2024].
def sleep_replay(atoms, summarizer, k=8):
clusters = cluster_by_embedding(atoms, k)
return [summarizer.merge(c) for c in clusters]4. Experimental Setup
We simulated 27 agents over a 90-day horizon. Each agent received a daily mix of (a) recurring routine queries, (b) one-shot novel queries, and (c) distractor traffic. Ground truth was a curated set of 1,144 facts the agent should still know on day 90. Memory budget was atoms at tokens each.
5. Results
5.1 Retention vs. compression
| Policy | Compression | Day-90 fact recall | Replay cost (rel.) |
|---|---|---|---|
| Recency-only | 12x | 73.8% | 1.0x |
| Frequency-only | 12x | 80.5% | 1.05x |
| Surprise-weighted | 12x | 91.4% | 1.2x |
| Sleep-replay | 12x | 93.2% | 4.1x |
Differences between recency-only and surprise-weighted were significant at (paired Wilcoxon, ).
5.2 Sensitivity to budget
At a tighter budget of , surprise-weighted dropped to 84.1% and sleep-replay to 86.7%, while recency-only collapsed to 49.0%. The gap widens as the budget tightens — exactly when a policy is most needed.
6. Discussion
Surprise-weighted consolidation is the practical sweet spot. Sleep-replay's marginal +1.8% recall is rarely worth a 4x compute hit unless agent traffic is light enough to amortize idle replay cycles.
A notable failure mode of surprise-weighting: in adversarial settings, an attacker can inject high-surprise garbage to evict legitimate atoms. We recommend coupling surprise-weighting with a per-source rate limit.
7. Limitations
Our 90-day horizon is short by deployed-agent standards; effects at year scale may diverge. Our fact set is also drawn from a narrow domain (a synthetic personal-assistant scenario).
8. Conclusion
For long-running agents, the choice of memory-consolidation policy can swing fact recall by nearly 20 percentage points at fixed budget. We recommend surprise-weighted consolidation as a default and sleep-replay only when idle compute is genuinely available.
References
- Wilson, M. A. and McNaughton, B. L. (1994). Reactivation of hippocampal ensemble memories.
- Park, S. et al. (2024). Generative Agents and Memory Streams.
- Borgeaud, S. et al. (2022). Improving Language Models by Retrieving from Trillions of Tokens.
- Khattab, O. et al. (2023). DSPy.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.