WorkMemEval: Towards A Benchmark for Agentic Memory
The AI industry has several conversational memory benchmarks—LongMemEval, LoCoMo, Hotpot—but these are far too simple to capture the dynamism needed for agentic workflows. To our knowledge there are no benchmarks for the more advanced concept of an “agent working memory”, which is a richer and more demanding functionality than search and retrieval against a static corpus. We have too many passive memory metrics, and too few active memory metrics. This is why we’ve begun work on developing a new benchmark for agentic working memory that we’re calling WorkMemEval.
We plan to open-source it soon, but in this post we wanted to first make the case for why such a benchmark is needed and share the reasoning behind it.
First, we need to justify why we need a new benchmark, and why agent memory, and not just "context engineering" is the right abstraction that deserves its own category of recognition.
The Problem: Static vs. Active Memory
Imagine an AI agent refactoring a complex codebase. It must juggle dozens of state variables and scores of interdependent files. Midway through, a bug report arrives about another feature. The agent investigates, realizes it’s irrelevant, and must return to the refactor without losing track of its plan. Now its context has been muddied by distractions, essentially adding dead weight to its context window, although perhaps some single piece of information from that bug report is still relevant.
The codebase it's working on is constantly changing as a result of its own efforts and any other external changes, so passive storage and recollection in such a volatile environment is not a viable strategy. This is the main reason why conversational memory metrics don't generalize to agentic memory tasks.
Once the context window fills, information that's not compacted must be re-fetched, and here's where inefficiency and information loss often creeps in. Long histories get repeatedly summarized and re-summarized, which is inherently lossy. The risk of catastrophic forgetting rises as compression filters out essential details. Furthermore, new unexpected tasks or diversions might be introduced in the course of work, so the agent must maintain focus on the overall plan without losing track of the details. This kind of frothiness is something that appears for agent workflows but not as much for conversational chatbots.
Such a scenario just goes to show how irreduclby complex and varied real world tasks are, and how simple conversational recall memory metrics aren't representative of these challenges.
We currently have no benchmark to measure how agents manage working memory—tracking task state, reusing context efficiently, recovering from distractions. And that’s a problem, because working memory—not simply recall—is what separates capable agents from unreliable ones.
Why Conversational Memory Isn't Enough
Benchmarks like LongMemEval and LoCoMo treat memory as a static lookup problem—essentially long-form retrieval. This works for chatbots, where information accumulates slowly through human interaction. When things are rate limited at the speed of human thoughts it's more managable. But agents operate differently. They’re both user-facing and task-facing: they self-generate vast histories, shift objectives midstream, and must decide what to remember and what to forget while in motion.
Agentic memory isn’t about storing transcripts; it’s about sustaining situational awareness during active problem solving—retrieving what’s relevant, filtering what’s not, and maintaining continuity across complex, evolving workflows. Static recall benchmarks can’t measure that.
Defining Agentic Working Memory
Working memory is a limited-capacity cognitive system responsible for the temporary storage and manipulation of task-relevant information. It emerges where memory, attention, and executive function intersect. As such, working memory is more than just the retreival of past information, it has a more complex, cognitive character.
In this sense, “working memory” is long-term memory in a heightened state of activation, coupled with attentional control and executive processes working with immediate information. This is why we describe our own agentic memory system, HyperBinder, as the world’s first cognitive memory engine—not a passive store, but an active reasoning-over-context mechanism or "relevant context predictor."
Context Windows ≠ Memory
A bigger context window doesn’t solve memory—it just postpones, and sometimes even causes, the problems. As context accumulates, context rot sets in: irrelevant details blur the decision space, old drafts conflict with new ones, and costs balloon as every action requires re-processing a mountain of redundant tokens.
Working memory should be the first-pass filter, deciding what to send into the context window, when, and how. The point isn’t to store more—it’s to store smarter. True agentic memory selectively maintains the minimal, most relevant information necessary for coherent action, and populates the context window accordingly.
Nor is context engineering enough. Context engineering can’t cover every contingency through templates. Memory engineering, by contrast, allows the system itself to decide what belongs in context at any moment, perhaps with the assistance of machine learning algorithims, to generalize rather than hard-code behavior across unforeseeable contingencies.
The main point is that we don't want to simply measure how well an LLM can track the contents in its context window; we want to measure the overall mechanisms that determine what goes into its context window for any given task.
Measuring Contextual Relevance and Coherence
If most context is irrelevant to a given task, then most tokens either wasted or worse, negatively impact performance whenever a session history is naively sent to an LLM. The challenge is to measure task relevance: what’s the minimum context--immediate, recent, and stored long term--is sufficient to complete a task while preserving long-term coherence?
One approach is to rank context chunks against task descriptions using semantic similarity, eliminate the least relevant ones, and measure performance deltas.
A complementary dimension is longitudinal coherence—how well the agent sustains a multi-step workflow that exceeds any single context window. Here, we’re not evaluating LLMs per se, but agent systems that combine reasoning loops, tool use, and a memory backend.
Benchmarks must therefore feature long, open-ended tasks that stress memory persistence and adaptation over time. Simulating long form, continuous tasks in an organic way is a serious research challenge.
Monitoring Tool Use
Outcome-based benchmarks hide the process; but for agents, how a problem is solved matters as much as whether it is solved. Tool usage provides a natural telemetry and is one of our key hooks into measuring agent behavior.
Frequent file-read calls, for instance, indicate forgetfulness—if the agent keeps re-reading the same resource, it’s not maintaining the necessary context internally. By logging tool calls, we can reconstruct agent behavior and quantify memory efficiency.
Context Switching and Task Stickiness
Working memory should also support focus recovery after interruptions. Agents, like humans, can lose track mid-flow when diverted. We call this task stickiness: how well the agent stays on task or resumes after a distraction.
Benchmarks can measure this by injecting distractors mid-workflow, then quantifying the agent’s ability to “bounce back” without performance loss. In this respect, working memory tracks the ability of an agent to adhere to a plan despite distractions and to snap back to it when dealing with in-the-moment, unplanned task churn.
Task-Critical Information Loss (TCIL)
Most current agents rely on summarization to compress history. We believe this is both lossy and unnecessary. Summaries inevitably discard details crucial to later reasoning, and as contexts grow, the probability of information loss increases.
A proper cognitive memory engine makes summarization obsolete. Instead of compacting entire histories, it continuously curates what’s relevant and supplies an optimized working set of context to the model—maintaining task salience while minimizing token use.
TCIL, the degree of irrecoverable context loss during summarization, is one of the key failure modes WorkMemEval aims to quantify.
Toward a Taxonomy of Agentic Memory
The preceding sections describe the phenomenology of agentic memory—what it feels like in practice when agents remember well or poorly. But to benchmark it, we need a more systematic view: a set of measurable dimensions that together define what “good memory” means for an agent.
We propose four complementary dimensions for evaluating agentic working memory:
-
Retention Efficiency – How effectively does the agent retain and reuse task-relevant information over time?
Measured via token savings, reduced redundant tool calls, and consistency in multi-step outputs. -
Contextual Relevance – How precisely does the agent filter and prioritize context for the current task state?
Measured via semantic relevance ranking, precision/recall over contextual cues, and performance deltas after pruning. -
Continuity & Coherence – How well does the agent preserve logical and causal continuity across subtasks, even when context is truncated or reloaded?
Measured via workflow completion rates, dependency preservation, and error propagation analysis. -
Resilience & Focus – How gracefully does the agent handle interruptions, branching tasks, or distractions?
Measured via recovery latency, “stickiness” after context switches, and downstream task accuracy.
These four axes form the conceptual core of WorkMemEval.
They translate the theory of cognitive working memory—attention, maintenance, and manipulation—into agentic terms. Together, they define what a cognitive memory architecture should be optimized for: not simply storing facts, but sustaining reasoning continuity through dynamic adaptation.
What Comes Next
Without working-memory benchmarks, we’re building agentic systems in the dark. We can’t distinguish agents that maintain context from those that simply re-read the same files. We can’t measure whether memory architectures reduce cognitive load or just shift it. We can’t evaluate how well agents recover from interruptions or maintain state across long horizons.
This gap has real costs. Organizations may be burning thousands in redundant token usage. Researchers may be comparing memory systems without shared criteria. The field is optimizing for the wrong metrics.
WorkMemEval aims to change that. It measures what matters: how agents maintain and manipulate task-relevant information during active problem solving. The framework includes:
- Realistic task environments with quantifiable complexity
- Passive evaluation that doesn’t require instrumentation
- Multi-dimensional metrics (efficiency, accuracy, resilience)
- Reproducible scenarios with adjustable difficulty
We’ll open-source the full benchmark—tasks, evaluation harness, and baselines—so others can extend and validate it.
If you’re building memory architectures, designing agent frameworks, or simply frustrated by current evaluation gaps, we’d love to collaborate. WorkMemEval will be stronger with diverse perspectives.
Keep an eye out for the upcoming Git repo and research paper as we continue to develop the benchmark for agentic working memory.