Agentic Context Management: Why the Model Should Manage Its Own Context

Mar 16, 2026

Introduction

Most agentic tools today conflate the conversation with the context. Every message sent, every tool result returned, every file read, every failed attempt: all of it accumulates in a single, undifferentiated stream that is passed wholesale to the model on every turn. The implicit assumption is that the conversation is the context, and that nothing can be done about this except to wait until the window fills and then compact everything at once.

This assumption is wrong. It doesn’t need to be this way.

The Problem No One Is Solving Correctly

The effective context window of large language models remains the primary bottleneck for complex, long-horizon agentic tasks. Even models with 1M+ token windows are insufficient for multi-hour agentic sessions, where the volume of tool calls, file contents, and intermediate reasoning can exceed the context limit of any production LLM. This problem is compounded by “context rot,” in which model performance degrades well before the nominal limit is reached, as the window fills with stale exploration, failed attempts, and raw data that has already been consumed.

The critical observation, and the one most existing work fails to fully appreciate, is that the context problem in agentic systems is not an input problem. A user does not arrive with 10 million tokens of data and ask the model to process it. A user arrives with a short request: “fix the auth bug,” “refactor the payment module,” “investigate why latency spiked.” The context problem emerges during the work itself. The model reads files, runs commands, explores dead ends, backtracks, tries new approaches. Over the course of a session, it generates hundreds of thousands of tokens of working context, the vast majority of which becomes irrelevant the moment it has been used.

This is fundamentally a working memory management problem, not a data processing problem. The challenge is maintaining a sharp, relevant context as the model works through a multi-step task over an extended conversation. Two recent papers have proposed architectures for active context management: Recursive Language Models (RLM) [1] and Lossless Context Management (LCM) [2]. Both contain valuable insights, but neither fully addresses the conversational working-memory problem that dominates real-world agentic use cases.

Background: The Shift Toward Active Context Management

Traditional approaches to context management are passive. Sliding windows truncate older messages. Compaction systems summarize the entire conversation when a token threshold is crossed. These approaches treat context as a queue: new content enters, old content exits, and no one asks whether the old content might be more important than the new.

The insight that context management should be an active process, one in which the system deliberately decides what to keep, what to compress, and what to discard, represents a genuine paradigm shift. Rather than passively accepting whatever happens to be in the window, an active system curates its context to maximize the relevance and utility of every token.

This insight is shared by RLM, LCM, and Agentic Context Management (ACM), the system presented in this essay. Where they diverge is in who performs the curation, when it happens, and what mechanisms are available.

Recursive Language Models: Full Autonomy, Wrong Problem

How RLM Works

Recursive Language Models [1], proposed by Zhang, Kraska, and Khattab (2026), take the most radical position in the design space. An RLM treats the user’s prompt not as something that enters the model’s context window, but as a variable in a Python REPL environment. The model never “sees” the input directly. Instead, it receives metadata about the prompt (its length, a short prefix, how to access it) and writes Python code to inspect, decompose, and process it.

Each iteration of the RLM loop follows a simple cycle: the model generates code, the code executes in the REPL, stdout is captured, and metadata about the output is appended to a short history. Crucially, the model can call llm.query(snippet) within its code, spawning recursive sub-calls on slices of the input. When the model is finished, it sets a Final variable in the REPL, and the loop terminates.

A typical RLM trajectory might look like this:

# Peek at the data
print(prompt[:200])

# Devise a chunking strategy
parts = prompt.split("Chapter")
results = []
for part in parts:
    results.append(llm.query(f"Summarize: {part}"))

# Combine results
Final = llm.query(f"Combine these summaries: {results}")

This is “symbolic recursion”: the model writes programs that invoke the LLM itself on programmatically constructed transformations of the input. The prompt lives in the environment as a variable and is manipulated through code, never through direct context ingestion.

The Strengths of RLM

RLM’s results are impressive on its chosen benchmarks. On tasks like S-NIAH, OOLONG, and BrowseComp, RLMs handle inputs up to two orders of magnitude beyond model context windows, substantially outperforming both vanilla frontier models and common long-context scaffolds. The architecture demonstrates that recursive self-invocation is a powerful primitive for processing large static inputs.

The Fundamental Mismatch

Despite these results, RLM is solving the wrong problem for the agentic context management use case.

There is no conversation. RLM has no conversation thread. There is no persistent back-and-forth between user and model. The architecture is designed for a single-shot pattern: receive a massive input, process it, return a result. This bears no resemblance to how people actually use coding agents, research assistants, or any long-running agentic tool.

The context problem is assumed to be an input problem. RLM’s entire architecture is predicated on the assumption that the challenge is a prompt too large to fit in context. In real agentic sessions, the challenge is the opposite: the prompt starts small and the model’s own activity fills the window with stale exploration. RLM has no mechanism for managing this self-generated context because it has no mechanism for ongoing interaction at all.

The model must be a competent programmer on every iteration. Each RLM invocation requires the model to devise a correct chunking strategy, write working Python to implement it, handle edge cases, write correct sub-call prompts, and aggregate results. All of this must happen correctly every time, from scratch, because there is no reusable structure between invocations. The RLM paper [1] itself notes that different models make fundamentally different strategic choices (GPT-5 is conservative with sub-calls while Qwen3-Coder launches one per line), suggesting the “right” strategy is highly model-dependent and fragile. This is asking the model to be simultaneously good at understanding the task, software engineering, and meta-cognitive strategy, on every single iteration.

The benchmarks are synthetic. S-NIAH (needle in a haystack), OOLONG (chunk transformation and aggregation), and BrowseComp (multi-hop QA over document corpora) are all “process this giant input” tasks. They stress-test context windows, not conversational working memory. They do not reflect the dominant use case for agentic AI: extended, interactive sessions where context accumulates organically through collaborative work.

RLM is an academically interesting demonstration that LLMs can write programs to process their own inputs. It is not a solution to the context management problem that matters most in practice.

Lossless Context Management: Right Problem, Wrong Locus of Control

How LCM Works

Lossless Context Management [2], proposed by Ehrlich and Blackman (2026), takes the opposite position from RLM. Where RLM gives the model full autonomy, LCM shifts the burden of context management entirely to the engine. The model operates within a conventional conversation thread, but the engine deterministically manages what appears in the active context window.

LCM’s architecture rests on two pillars: the Immutable Store, which persists every message verbatim and is never modified; and the Active Context, the window actually sent to the LLM on each turn, assembled from a mix of recent messages and precomputed summary nodes.

The core data structure is a hierarchical Directed Acyclic Graph (DAG) maintained in a persistent store (their reference implementation uses embedded PostgreSQL). As the active context fills, older messages are compacted into Summary Nodes: compressed representations derived from the originals via LLM summarization. Summary nodes can themselves be summarized, forming a multi-resolution hierarchy. The original messages remain in the immutable store and can be recovered via lcm_expand, which reverses the compaction.

The engine manages compaction through a deterministic control loop with soft and hard token thresholds. Below the soft threshold, nothing happens: the store acts as a passive logger and the user experiences raw model latency. When the soft threshold is exceeded, the engine triggers asynchronous compaction between turns. If the hard threshold is reached, the engine blocks and compacts the oldest block in the active context until the window is within budget.

To guarantee convergence (avoiding the case where a summary is longer than its source), LCM employs a three-level escalation protocol: first, a detail-preserving summary; if that fails to reduce tokens, a bullet-point summary at half the target size; and if that also fails, a deterministic truncation that requires no LLM call at all.

The Strengths of LCM

LCM correctly identifies the conversational working-memory problem as the one worth solving. Its evaluation benchmarks against Claude Code on realistic coding tasks, not synthetic data-processing exercises. Several of its design choices are genuinely valuable:

Zero-cost continuity. Below the soft compaction threshold, LCM adds zero overhead. For short tasks that fit comfortably in the context window, the system is invisible. This is a practical advantage over architectures that impose setup costs on every interaction.

Guaranteed convergence. The three-level escalation protocol ensures that compaction always reduces context size. This eliminates the failure mode where a summarization attempt produces output longer than its input, a real risk with LLM-based summarization.

Lossless retrievability. Because the immutable store retains every original message, any summary can be expanded back to its source material. The model sees summary text annotated with stable identifiers that it can expand on demand, without needing to know that compaction occurred.

Where LCM Falls Short

The engine lacks semantic judgment. LCM’s compaction policy is simple: compact the oldest block. This is a temporal heuristic, not a semantic one. The engine has no way to know that the architecture decision from turn 5 is more important than the file dump from turn 50. It cannot distinguish a dead-end exploration from a critical insight. It cannot recognize that a particular test output is the key to understanding the bug the model has been chasing for 30 turns.

This is not a minor limitation. In real agentic sessions, the importance of context is highly non-uniform and not correlated with recency. Critical decisions, key constraints, and user requirements often appear early in the conversation, while recent turns may be filled with routine file reads or failed attempts. A policy that compacts the oldest content first will systematically destroy the most important context.

The model has no voice in what gets compacted or when. The model does influence what is preserved through the summaries it generates during compaction. But it has no control over when compaction is triggered, which messages are selected for compaction, or the ability to protect specific content from being compacted at all. It cannot say “this was a dead end, compact it now” or “this decision is critical, never touch it.” The timing and targeting of compaction are entirely engine-driven.

This is a fundamental philosophical choice, and I believe it is the wrong one. The model is the only entity in the system that understands the task, the user’s intent, and which information is load-bearing. Excluding it from memory management decisions means the system must rely on heuristics that are necessarily inferior to informed judgment.

Parallelisation is an independent concern. LCM bundles two distinct capabilities: context management (the DAG, immutable store, compaction thresholds) and parallel data processing (llm_map, agentic_map). The parallel processing operators are genuinely useful, but they solve a different problem: batch processing of large datasets without polluting context. This is an execution concern, not a memory management concern. Any agent framework can implement map-reduce-style parallelism independently of its context management strategy. Conflating the two obscures the actual contribution of each.

The infrastructure cost is significant. LCM requires embedded PostgreSQL with transactional writes, referential integrity, and full-text search. The DAG must be maintained with parent pointers, provenance tracking, and atomic compaction operations. This is a substantial amount of infrastructure for what is ultimately a context curation problem, and much of it exists to compensate for the fact that the engine is making decisions without semantic understanding of the content.

Agentic Context Management: The Case for Informed Autonomy

I present Agentic Context Management (ACM), an alternative architecture that occupies a distinct position in the design space: one in which the model itself actively manages its context using purpose-built tools, while the system provides an immutable conversation log that ensures nothing is ever lost.

Core Architecture

The system separates two concerns that are conflated in traditional architectures:

The Conversation Log: an immutable, append-only record of every message and directive in the session. Entries are frozen at write time and assigned sequential IDs (msg-001, msg-002, etc.). The log is persisted as newline-delimited JSON for durability and session resumption. It is never modified after write.
The Context View: a computed projection of the log, assembled fresh on every turn by applying the accumulated set of context management directives. The view determines what the model actually sees. It is ephemeral, derived, and can change between any two turns without altering the underlying history.

This separation mirrors the distinction between an immutable event log and a materialized view in database systems. The log is the source of truth; the view is optimized for the current query (in this case, the model’s next turn).

The Model’s Tools

The model manages its context through a single tool with four operations:

remove_context: Remove specified messages from the context view. The model can optionally provide a detailed summary that replaces the removed messages, and a label for later retrieval. The original messages remain in the log, indexed by label.

pin: Mark messages as undeletable. Pinned messages bypass all removal mechanisms, including summary replacement and cascade removal. This allows the model to protect information it knows will be needed throughout the session: user requirements, architectural decisions, key constraints.

unpin: Remove protection from pinned messages, making them eligible for removal when they are no longer needed.

retrieve_context: Pull previously removed content back into the active view, either by label (recovering an entire batch of related messages) or by specific message IDs. This enables the model to re-examine raw data that it previously compressed, if it discovers the summary was insufficient.

How the Context View Is Built

On every turn, the context builder deterministically transforms the log into a view by applying all accumulated directives: filtering out removed and summarized messages, inserting summary text at the appropriate positions, cascading removals across tool-call pairs, and injecting any retrieved content. Each surviving message is prepended with its ID so the model can reference it in future management operations. The result is a clean message array ready for the API, shaped entirely by the model’s own curation decisions.

Dynamic Awareness

The model receives real-time statistics about its context on every turn:

### Current Context Stats
- Messages in log: 247
- Visible in context: 89
- Removed: 158
- Pinned: 12
- Tokens: 87,000 / 200,000 (43%)

These stats give the model the information it needs to make informed management decisions: how full the context is, how much has been removed, what labels are available for retrieval, and which messages are protected. The model is explicitly instructed to manage context when utilization exceeds approximately 30%, or when it has clearly finished a task and no longer needs the raw data.

Compatibility with Prompt Caching

A natural concern with any active context management system is prompt cache invalidation. Modern LLM providers cache the prefix of the conversation to avoid reprocessing unchanged tokens on subsequent turns. If the context view mutates on every turn, the cache is busted and every turn pays full prefill cost.

ACM is designed to be cache-friendly. The context view only changes when the model explicitly issues a directive; on turns where no context management occurs, the view is identical to the previous turn plus the new messages appended at the end. This means the vast majority of turns preserve the cached prefix entirely. The dynamic stats block is injected at the tail of the context (appended to the latest user message), so it does not mutate the conversation history that precedes it.

When a directive does fire (a remove, pin, or retrieve), the cache is partially or fully invalidated for that turn. But this is infrequent by design: the model manages context at natural breakpoints (after completing a subtask, after consuming a large file), not on every turn. In practice, the cache hit rate remains high because management operations are sparse relative to the total number of turns in a session.

Why This Architecture Is Superior

The model has semantic judgment. Only the model knows what matters. It has been present for every turn of the conversation. It understands the user’s goal, the current approach, what has been tried and failed, and what information is load-bearing. When it removes context, it can make informed choices: discard the three failed debugging attempts, keep the test output that revealed the root cause, pin the user’s constraint that the fix must be backward-compatible.

The tools are structured, not arbitrary. Unlike RLM, the model does not write ad-hoc Python to manage its context. It uses four well-defined operations with clear semantics. This eliminates the failure mode of writing incorrect chunking code, handling edge cases wrong, or producing malformed sub-call prompts. The model makes decisions (what to remove, what to keep, what to summarize); the system handles execution (applying directives, building views, maintaining the log).

This occupies a productive middle ground in the design space. The model retains semantic judgment (which RLM also has but LCM discards), while the operations are deterministic and well-defined (which LCM also has but RLM discards). ACM gets the best of both.

Pinning is uniquely expressive. Neither RLM nor LCM has an equivalent. Pinning allows the model to declare “this information is important for the entire duration of this session.” User requirements, key architectural decisions, critical constraints: these should never be compacted, regardless of when they appeared. LCM’s engine may compact any message based on its position in the context, with no mechanism for the model to protect specific content. Pinning encodes that protection directly.

Retrieval is a safety net for the model’s own decisions. Both ACM and LCM support retrieval from an immutable store. LCM provides lcm_expand and lcm_grep for recovering compacted content. ACM provides retrieve_context by label or specific message ID. The mechanism is similar, but the relationship between retrieval and the rest of the system differs. In LCM, the model retrieves content to undo decisions the engine made on its behalf. In ACM, the model retrieves content to correct its own judgment. This means the model has a complete feedback loop: it decides what to remove, observes the consequences, and recovers if the decision was wrong. The model can work in a “progressive summarization” style, removing raw data with a detailed summary and retrieving the originals if the summary proves insufficient, all driven by its own assessment of what it needs.

The architecture is minimal. The entire system consists of an append-only log, a directive processor, a context builder, and four tool operations. There is no PostgreSQL, no DAG, no transactional compaction, no worker pools, no three-level escalation protocol. The implementation is a few hundred lines of TypeScript.

This simplicity is not a limitation; it reflects the fact that most of the “infrastructure” in LCM exists to compensate for the engine’s lack of semantic understanding. When the model is making the decisions, the system needs only to provide clean primitives and an immutable record. The complexity lives in the model’s judgment, which improves with every generation of frontier models at no additional engineering cost.

The conversational use case is primary. This system is built for the dominant agentic use case: extended, interactive sessions where context accumulates through collaborative work. It does not attempt to solve batch data processing (LCM’s llm_map) or massive-input decomposition (RLM’s recursive sub-calls). These are valid problems, but they are independent concerns that can be addressed with orthogonal mechanisms (sub-agents, map-reduce operators) layered on top of any context management strategy.

The Bet on Model Capability

This architecture makes an explicit bet: that models are good at judging what is important in their own working context, and that this capability will only improve.

The alternative bet, made by LCM, is that models are not reliable enough for this meta-task, and that deterministic engine heuristics are preferable. This may have been a reasonable position in 2024, but it becomes less defensible with each capability improvement. Engine heuristics do not improve over time; they are static. Model judgment improves with every training run.

A Note on Evaluation

ACM does not yet include benchmark results comparable to those presented in the RLM and LCM papers. This is a practical constraint, not a philosophical one. Running a benchmark suite like SWE-bench Verified or OOLONG at the scale needed for meaningful comparison requires hundreds of frontier-model invocations per configuration. With Opus 4.6 as the target model, this cost is prohibitive for independent research. I consider formal evaluation essential future work and welcome collaboration from teams with the compute budget to run it.

Positioning in the Design Space

The three architectures can be understood as points on a spectrum of model autonomy:

The LCM paper [2] draws an analogy to programming language design: RLM is like GOTO (unrestricted control flow, maximum power, hard to reason about), while LCM is like structured programming (constrained primitives, easier to reason about, less expressive).

This analogy is useful but incomplete. It frames the choice as a trade-off between power and safety, as though the designer must pick one. ACM rejects this framing. The model retains full decision-making authority over its own working memory, which is the power that matters, while operating through four deterministic operations that eliminate the implementation variance that makes RLM fragile. The constraint is not on what the model can decide, but on how those decisions are executed. This is not a compromise between RLM and LCM; it is a different axis entirely: structured execution of model decisions.

Conclusion

The context management problem for long-running agentic sessions is, at its core, a problem of working memory curation. The model generates vast quantities of context as it works, and the challenge is maintaining a relevant, useful context window over hours of interactive exploration.

RLM solves a different problem entirely (processing massive static inputs) and has no conversation to manage. LCM solves the right problem but delegates the critical decisions to an engine that lacks the semantic understanding to make them well.

ACM gives the decisions to the only entity that can make them well: the model itself. It provides structured, deterministic tools that eliminate the variance of ad-hoc code generation. It maintains an immutable log that makes every decision reversible. And it bets on model capability, the one variable in the system that reliably improves over time.

The context window is the model’s working memory. The model should manage it.

References

[1] Zhang, A. L., Kraska, T., & Khattab, O. (2026). Recursive Language Models. arXiv preprint arXiv:2512.24601.

[2] Ehrlich, C. & Blackman, T. (2026). LCM: Lossless Context Management. Voltropy PBC. arXiv preprint arXiv:submit/7269166.

Dead Neurons

Discussion about this post

Ready for more?