Agents That Remember: Scaling Lifelong Learning in Language Models

Author: Denis Avetisyan

New research introduces a comprehensive benchmark for evaluating how effectively AI agents can learn from ongoing experience and refine their own memories.

The ReMem agent operates through iterative search, synthesis, and evolution of learned memory across tasks, facilitated by a three-module architecture-reasoning and decomposition, memory refinement through retrieval, pruning, and organization, and environmental execution-to dynamically adapt and perform.

This work presents Evo-Memory, a framework for assessing and improving test-time learning and self-evolving memory in large language model agents, demonstrating substantial performance gains across diverse tasks.

While large language models excel at many tasks, their ability to continuously learn and retain experience remains a critical limitation for real-world agentic applications. To address this, we introduce Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory, a comprehensive framework and benchmark designed to evaluate and advance self-evolving memory in LLM agents. Our results demonstrate that continual refinement of memory-through mechanisms for retrieving, integrating, and updating past interactions-significantly enhances performance across diverse single-turn and multi-turn tasks. Will this focus on test-time learning unlock the potential for truly lifelong intelligence in LLM-powered agents?

The Limits of Recall: Towards Genuine Reasoning

Despite the remarkable proficiency of Large Language Models in generating human-quality text and performing various linguistic tasks, these systems frequently encounter difficulties when faced with problems requiring extended, sequential thought processes or novel circumstances. The architecture of many current models prioritizes pattern recognition within vast datasets, which, while effective for common scenarios, proves inadequate when confronted with situations demanding flexible problem-solving or extrapolation beyond learned data. Essentially, these models excel at recalling information but struggle with genuine reasoning – a critical distinction that limits their applicability in dynamic, real-world contexts. This limitation manifests as errors in multi-step calculations, difficulties in adapting to unforeseen changes in input, and a general inability to generalize knowledge effectively to entirely new situations, highlighting the need for more robust and adaptable artificial intelligence systems.

Existing artificial intelligence systems, while proficient in specific tasks, frequently falter when confronted with scenarios demanding flexible problem-solving and adaptation. This limitation stems from a fundamental deficiency in their ability to dynamically retain and utilize past experiences – a true ‘dynamic memory’ – coupled with a lack of ‘reflective’ capabilities that allow them to analyze their own performance and adjust strategies accordingly. Consequently, these systems struggle in open-ended environments – those characterized by unpredictable inputs and evolving demands – where consistent performance requires more than simply recalling pre-programmed responses. Without the capacity for self-assessment and iterative improvement, current approaches remain constrained, hindering the development of genuinely intelligent agents capable of navigating complex, real-world challenges.

The future of artificial intelligence likely hinges on a transition from static models to self-evolving agents – systems engineered not just to respond to inputs, but to critically examine their own performance and iteratively refine their internal processes. These agents move beyond simply processing data; they actively learn how to learn, employing continuous self-reflection to identify weaknesses and enhance strengths. This dynamic approach, inspired by biological evolution and metacognition, promises to overcome the limitations of current large language models, which struggle with adaptability and complex reasoning. By embedding mechanisms for self-assessment and improvement, these agents can navigate novel situations, optimize strategies, and ultimately achieve a level of intelligence that transcends pre-programmed responses, opening doors to genuinely autonomous and robust AI systems.

A stateful agent effectively learns from and reuses past experiences to handle both multi-turn tasks like embodied manipulation and single-turn tasks such as equation solving.

Memory as More Than Storage: The Architecture of Adaptation

Agent performance is fundamentally linked to the capacity of its memory systems, however, basic storage and retrieval mechanisms are insufficient for complex tasks. Effective agent memory requires more than simply recalling stored data; it necessitates the ability to prioritize information, manage relevance, and adapt to changing contexts. While traditional databases offer storage, they lack the dynamic reasoning and contextual understanding required for agents operating in complex environments. Consequently, advanced architectures focus on not only what is remembered, but how it is accessed, updated, and utilized to inform decision-making processes, moving beyond simple recall to active memory management and contextual reasoning.

Workflow Memory systems enhance agent performance by storing and reusing successful task strategies. Systems like Dynamic Cheatsheets facilitate the creation of easily accessible, knowledge-rich resources that agents can consult during operation. Agent Workflow Memory specifically focuses on capturing and re-applying sequences of actions that have previously led to successful outcomes. This approach moves beyond simple data storage by providing agents with pre-defined plans, reducing the need for real-time problem-solving and increasing operational efficiency. By leveraging previously learned workflows, agents can consistently execute complex tasks with improved speed and reliability, effectively automating aspects of their decision-making process.

LangMem and MemoryOS represent advancements in agent memory architectures by applying concepts from operating systems and large language models. MemoryOS structures agent memory as a virtualized operating system, employing concepts like address spaces, permissions, and file systems to organize and control access to information. LangMem utilizes language modeling techniques to encode memories as embeddings, allowing for semantic similarity searches and contextual retrieval beyond keyword matching. Both architectures prioritize efficient indexing and retrieval, enabling agents to access and utilize relevant information more effectively than traditional key-value stores, and facilitating dynamic memory allocation and garbage collection to manage resources and prevent memory bloat.

Differentiable Read-Write Controllers represent a refinement of memory access mechanisms in agent architectures, moving beyond simple retrieval to prioritize information based on learned relevance. These controllers utilize differentiable functions to assign weights to memory entries, effectively modulating the strength of read and write operations. This allows the agent to focus computational resources on the most pertinent data, improving efficiency and performance. The “differentiable” aspect is critical, as it enables the agent to learn these weighting functions through gradient descent, optimizing memory access patterns based on experience and task demands. Consequently, the agent can dynamically adapt its memory focus, enhancing its ability to process information and make informed decisions.

ReMem consistently achieves a higher cumulative success rate than the History baseline across diverse interactive agent tasks, including ALFWorld, BabyAI, PDDL, and ScienceWorld, as demonstrated by the rolling average performance trend.

From Static Response to Dynamic Improvement: The Power of Reflection

Test-time learning enables autonomous agents to modify their behavior and enhance performance after initial training, while actively interacting with an environment. This adaptation is crucial for addressing distribution shifts – discrepancies between training and deployment data – and handling unforeseen circumstances not encountered during training. Unlike traditional static models, test-time learning allows agents to continually refine their reasoning and problem-solving capabilities based on real-world experience. This is achieved by leveraging incoming data to update internal parameters or strategies, effectively allowing the agent to learn and improve throughout its operational lifespan, rather than remaining fixed to its initial programming.

Agent-based test-time learning frameworks, such as Reflexion and Voyager, incorporate a reflection mechanism that enables agents to analyze their past actions and outcomes to identify performance deficiencies. This process typically involves the agent evaluating its completed tasks, pinpointing errors or suboptimal strategies, and generating corrective actions or revised plans. The reflection stage isn’t simply error detection; it involves reasoning about why a failure occurred, allowing the agent to generalize lessons learned and improve its future performance on similar or novel tasks. The output of this reflection process is then used to modify the agent’s behavior, often through adjustments to its internal reasoning processes or prompting strategies, effectively guiding adaptation during deployment without requiring explicit retraining.

Experience reuse in agent-based systems is the process by which previously encountered reasoning pathways and solutions are generalized and applied to novel, previously unseen tasks. This differs from rote memorization by focusing on the underlying strategies employed rather than specific input-output pairings. Effective experience reuse requires the agent to abstract key principles from completed tasks, allowing it to identify analogous situations and adapt its problem-solving approach accordingly. This capability is crucial for addressing distribution shifts and achieving robust performance in dynamic environments where the agent encounters situations not explicitly covered in its initial training data. The successful implementation of experience reuse mechanisms directly contributes to an agent’s ability to learn continuously and improve its performance over time without requiring explicit retraining.

Retrieval-based memory systems are a core component of agent-based test-time learning frameworks, facilitating continuous improvement by allowing agents to access and leverage previously encountered experiences. These systems typically store experiences – encompassing states, actions, and resulting outcomes – as embeddings within a vector database. When faced with a new situation, the agent retrieves the most similar past experiences based on embedding similarity, providing contextual information for decision-making. This retrieved information is then used to inform the agent’s reasoning process, allowing it to adapt strategies and improve performance on novel tasks without explicit retraining. The effectiveness of these systems relies on efficient indexing and retrieval algorithms, as well as the quality of the embeddings used to represent experiences.

The system leverages conversational recall to retrieve factual information, such as solutions to equations, and experience reuse to apply learned reasoning strategies like formulaic approaches.

Evo-Memory: A Benchmark for Self-Evolving Intelligence

A novel framework, termed Evo-Memory, offers a robust and comprehensive benchmark for evaluating the capacity of large language model (LLM) agents to develop and refine their own memory systems. This system goes beyond static memory by allowing agents to adapt and improve their recall and application of information during interactive tasks. Demonstrating the efficacy of this approach, the ReMem agent, tested within the Evo-Memory framework, achieves an impressive success rate of up to 97% on complex, multi-turn interactive challenges. This performance suggests that self-evolving memory architectures hold significant promise for creating more adaptable and effective artificial intelligence, capable of navigating dynamic and unpredictable environments with greater autonomy and efficiency.

The Evo-Memory framework establishes rigorous testing grounds for artificial intelligence through the implementation of Task Streams and multi-turn, goal-oriented environments. These simulated scenarios move beyond simple, isolated challenges, instead demanding agents navigate complex, interactive situations akin to real-world problem-solving. By requiring sustained performance across multiple conversational turns and diverse tasks within a stream, the framework rigorously assesses an agent’s ability to not just respond to prompts, but to learn and adapt its strategies over time. This approach effectively pushes the boundaries of agent intelligence, moving beyond rote memorization towards genuine, evolving competence in dynamic environments, and provides a platform to evaluate how well agents can maintain context and achieve long-term goals.

Evo-Memory offers a robust platform for evaluating how different memory architectures enable agents to overcome novel obstacles. Researchers leverage this framework to compare systems like Hierarchical Memory, which organizes information in layered structures, against more advanced approaches such as ReMem – a system designed for continuous self-improvement through iterative refinement of its knowledge base. By subjecting these architectures to increasingly complex, multi-turn tasks within simulated environments, Evo-Memory quantifies their ability to not only store information but to dynamically adapt and apply it to unforeseen challenges, revealing critical insights into the development of truly intelligent and resilient agents.

Evaluations within the Evo-Memory framework establish ExpRAG as a crucial performance benchmark, revealing the significant advantages of more advanced memory architectures. Comparative analyses demonstrate that ReMem, a self-evolving memory system, markedly improves task efficiency; specifically, on the ALFWorld interactive environment, ReMem reduces the average number of steps required to successfully complete tasks from 22.6 to just 11.5. This substantial decrease highlights ReMem’s capacity for adaptive learning and optimized problem-solving, suggesting that self-evolving memory represents a considerable advancement over traditional retrieval-augmented generation (RAG) systems and opens avenues for building more efficient and intelligent agents.

Across four benchmarks, ReMem consistently outperforms other methods-History, ExpRecent, and ExpRAG-by requiring fewer steps to complete tasks, indicating superior efficiency.

Towards a Future of Continuous Adaptation

The pursuit of truly adaptive intelligence is rapidly gaining momentum through the synergistic combination of innovative technologies. Advanced memory architectures, moving beyond traditional storage, now allow agents to retain and organize information with unprecedented efficiency, enabling more nuanced responses to complex stimuli. This is further enhanced by test-time learning, where systems refine their capabilities during operation, rather than solely relying on pre-programmed knowledge. Critically, the addition of self-reflection – the ability to analyze past performance and identify areas for improvement – completes the cycle, fostering continuous learning and optimization. This convergence isn’t merely incremental; it signifies a shift toward systems capable of independent growth and adaptation, promising a future where artificial intelligence doesn’t just respond to challenges, but actively evolves to overcome them.

The trajectory of artificial intelligence is shifting towards agents capable of not just performing tasks, but of mastering them through continuous self-improvement. Recent advancements are enabling systems to move beyond pre-programmed responses and embrace a dynamic learning cycle, where errors become opportunities for refinement. This isn’t simply about increasing processing power; it’s about building architectures that facilitate genuine adaptation. Agents can now analyze their own performance, identify weaknesses, and adjust their strategies accordingly, leading to progressively enhanced skillsets. The result is a capacity to confront increasingly complex challenges – tasks previously insurmountable – and consistently elevate performance levels, blurring the lines between programmed behavior and true intelligence.

The development of adaptive intelligence holds transformative potential across numerous sectors, most notably in robotics, where systems can move beyond pre-programmed routines to navigate unpredictable environments and learn from physical interactions. Automation benefits through the creation of processes that are not simply repetitive, but dynamically adjust to changing conditions and unexpected variables, increasing efficiency and reducing errors. Perhaps most profoundly, personalized assistance stands to be revolutionized, as truly intelligent systems can understand individual user needs, anticipate requirements, and provide tailored support far exceeding current capabilities. This convergence promises a future where technology doesn’t just respond to human input, but proactively learns and evolves alongside users, ultimately ushering in an era of seamless and intuitive interaction.

The development of truly self-evolving agents hinges on continued investigation into frameworks like Evo-Memory, which allow artificial intelligence to not just learn, but to fundamentally restructure its own knowledge base. Recent evaluations of the ReMem architecture reveal a strong correlation between an agent’s ability to improve and the similarity of tasks it encounters, with scores of 0.717 when assessed by Gemini 2.5 Flash and 0.563 using Claude 3.7 Sonnet. This suggests that agents can leverage past experiences more effectively when presented with related challenges. Furthermore, ReMem achieved an exact match score of 0.65 on single-turn reasoning and question answering benchmarks, demonstrating a capacity for sophisticated cognitive processing and indicating the potential for creating AI systems capable of continuous self-optimization and increasingly complex problem-solving.

ReMem's performance improvement over a historical baseline correlates positively with the similarity of tasks within the dataset. — ReMem’s performance improvement over a historical baseline correlates positively with the similarity of tasks within the dataset.

The pursuit of increasingly elaborate agentic memory systems often obscures a fundamental truth: effective intelligence isn’t about accumulating information, but about distilling it. This work, with its Evo-Memory benchmark, acknowledges that continual refinement-the relentless pruning of the irrelevant-is paramount. It’s a quiet rebellion against the bloat of modern frameworks, a framework, one might observe, often called upon to hide the panic. As Carl Friedrich Gauss observed, “If I speak without understanding, then it is a shame.” Evo-Memory forces a reckoning with understanding; it demands that LLM agents not merely store experience, but learn from it, shedding the superfluous to reveal the core principles that drive performance in both single and multi-turn tasks.

What Remains?

The presented work establishes a substrate – Evo-Memory – for measuring the plasticity of large language model agents. This is not, however, an arrival. The benchmark itself, while necessary, merely illuminates the vastness of what remains unknown regarding true experiential learning. Current evaluations privilege performance gains, but neglect the cost of adaptation – the energetic efficiency, if you will, of retaining versus discarding information. A truly intelligent system doesn’t simply learn more; it learns better what to forget.

Future inquiry should therefore shift focus. The question isn’t whether an agent can accumulate experience, but whether it can curate it. Investigating the interplay between memory structure and task complexity is paramount. The field assumes a universal memory architecture; yet, specialization – a tiered system where procedural knowledge resides separately from episodic recollection – is likely crucial. To believe otherwise is to equate data storage with understanding.

Ultimately, the pursuit of lifelong intelligence demands a reckoning with simplicity. The tendency to add layers of complexity, to build ever-larger models, should be tempered with a ruthless pruning. The most profound advancements will likely emerge not from what is added, but from what is elegantly removed.

Original article: https://arxiv.org/pdf/2511.20857.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/