The Dirty Secret of Million-Token Context Windows
Llama 4 claims ten million tokens of context. GPT-5.2 advertises 400,000. Anthropic’s Claude Sonnet 4.5 offers one million in beta. The pitch is seductive: throw your entire codebase into the prompt and let the model figure it out. I have been doing this for months, and I can report that it mostly does not work. Yet.
The term “context rot” emerged on Hacker News in mid-2025 to describe what practitioners had been quietly noticing: LLMs get dumber as you give them more context, even when the task stays the same. This is not about harder problems requiring more text. The models genuinely degrade on trivially simple tasks as input length grows, and the research now confirms it. The good news is that we increasingly understand why, and the fixes are coming.
The Numbers Are Worse Than You Think (For Now)
Chroma tested 18 frontier models in July 2025, including GPT-4.1, Claude 4, and Gemini 2.5, on controlled retrieval tasks [1]. Performance dropped 20-50% between 10K and 100K tokens on basic needle-in-a-haystack (NIAH) variants. When the task required even minimal inference (knowing that the Semper Opera House is in Dresden), the degradation was worse.
A Stanford study found that with just 4,000 tokens of context, accuracy drops from 75% to 55% based purely on where you place the relevant information [2]. Put the key fact at the beginning: 75%. Put it in the middle: 55%. The information is identical; only the position changed. This is not at the edge of any context window. This is nothing.
The HELMET benchmark evaluated 59 models on real-world tasks and found that needle-in-a-haystack scores, the benchmark everyone uses to show off long-context capabilities, do not predict actual performance [3]. Most models ace NIAH while failing dramatically on tasks requiring synthesis or instruction-following. The gap between open and closed-source models widens as context grows, which actually tells us something important: the frontier labs have techniques that help, and those techniques will eventually make their way to everyone.
Why Models Forget What You Just Told Them
Three factors compound to create this mess, and understanding them points toward solutions.
First, attention dilution. The softmax function forces attention weights to sum to 1, so as context grows, each token’s share of attention shrinks on average. Information that would be salient in a 4K context competes with more noise in a 128K context. This sounds damning, but it is worth noting that softmax’s selectivity is also why transformers work so well in the first place. The challenge is not that softmax is broken; it is that models need to learn sharper discrimination at longer contexts, which is a training problem rather than an architectural dead end.
Second, training distributions are skewed. Documents front-load important information because that is how humans write. Models learn this statistical regularity and develop the characteristic U-shaped attention pattern: they attend to the beginning and end while largely ignoring the middle. This is entirely fixable with better training data, and we have strong evidence it works.
Third, position encodings get fuzzy at extreme lengths. RoPE (Rotary Position Embedding) and its variants were designed for shorter sequences and are extended through interpolation tricks. The model’s sense of where tokens are becomes increasingly approximate. This is an active area of research with several promising approaches, including training natively on longer sequences rather than interpolating.
The Case for Optimism
Let us be honest about how far we have come. Three years ago, 2K tokens was standard. Today we are arguing about whether 200K or 400K is reliably usable. That is a 100-200x improvement in a remarkably short time. The trend line is not slowing down.
Princeton’s ProLong project demonstrates that the fixes are tractable [4]. Their 8B model matches or beats Llama-3.1-8B on most long-context benchmarks while using only 5% as many long-context training tokens. The key insight is that training data composition matters more than brute-force scaling. Code repositories and books, which have genuinely long-range dependencies unlike typical web text, are excellent training sources. Mixing them with high-quality short-context data prevents capability regression. Training on sequences longer than your target evaluation length provides useful headroom.
The U-shaped attention problem is particularly tractable because it is fundamentally a data distribution issue. Prose front-loads information, but code does not. A function call might reference a definition 10,000 lines away. Dependencies scatter throughout files. You cannot skim the first paragraph of a codebase and understand what it does. As labs incorporate more code and code-like synthetic data into training mixes, the models learn that important information can appear anywhere. The ProLong results suggest this is already working.
The architectural picture is also more encouraging than the “attention dilution” framing might suggest. Yes, softmax creates competition for attention, but this competition is what makes transformers powerful. The question is whether models can learn to allocate attention effectively across long sequences, and the answer appears to be yes, with the right training. Sparse attention variants, hybrid architectures incorporating state-space models, and improved position encodings are all showing promise. We do not need a fundamental breakthrough; we need continued refinement of approaches that are already working.
What the Labs Are Shipping
The major labs are clearly taking this seriously. OpenAI’s GPT-5.2, released in December 2025, explicitly emphasises improved long-context understanding and introduces context compaction for extended workflows [5]. Their benchmarks show GPT-5 outperforming previous models on OpenAI-MRCR by a margin that grows substantially at longer input lengths, which is the opposite of the usual pattern where advantages shrink as context grows. Anthropic’s Claude Opus 4.5 takes a complementary approach with “Infinite Chat,” using memory tools and context compaction to maintain coherence across sessions that would otherwise exceed the 200K base window [6]. Claude Sonnet 4.5 now offers a 1M token context window in beta.
The reasoning model development is particularly exciting for long-context reliability. Extended thinking allows models to effectively expand their working memory by writing intermediate results into their chain of thought. This sidesteps attention dilution because the model can explicitly surface and re-attend to information it would otherwise lose. When a reasoning model works through a long document, it can extract and restate key facts as it goes, keeping them in the “hot” part of the attention window. Early results suggest this substantially improves long-context performance, and reasoning capabilities are improving rapidly.
Where This Is Heading
My prediction is that usable context will roughly double relative to advertised context over the next 18 months. If today you can reliably use 30-50% of a 200K window, by mid-2027 you should be able to reliably use 60-80% of a 500K+ window. This would come from a combination of better training data (more code, better synthetic long-context examples), architectural refinements (improved position encodings, selective attention mechanisms), and inference-time techniques (reasoning chains, context compaction, retrieval augmentation).
On the architectural front, sparse attention has quietly crossed an important threshold. DeepSeek’s V3.2 introduced what they call “DeepSeek Sparse Attention” in September 2025, claiming output quality nearly identical to their dense baseline while dramatically reducing computational costs for long sequences [7]. Recent academic work on “Native Sparse Attention” shows models achieving superior average performance on standard benchmarks despite high sparsity, and outperforming dense baselines on LongBench [8]. These are not minor approximations anymore; they are matching or beating full attention on real tasks. If these results hold up and get integrated into frontier models, the usable context ceiling could jump substantially in the next generation.
The million-token context window that actually works is probably 2-3 years away rather than 5-10. The architectural challenges are real but not insurmountable, the training solutions are understood if not yet fully deployed, and the economic incentives are enormous. Every major lab is pouring resources into this problem because reliable long context unlocks genuinely new applications: agents that can work on multi-day projects, analysis of entire codebases or legal corpora, research assistants that maintain context across hundreds of papers.
The Practical Upshot
Today, usable context is roughly 30-50% of advertised context for retrieval tasks, and less for anything requiring complex reasoning. Context engineering remains important: put critical information at the start and end, use RAG for very large corpora, and consider chunking with summarisation for documents that exceed reliable limits.
These techniques will remain useful even as models improve, but they will become less mandatory and more optional. The gap between advertised and usable context is real, but it is closing. The trajectory is clearly positive, and the research community has a solid understanding of what needs to happen next.
We are not waiting for a miracle. We are waiting for engineering, and engineering tends to get done.
[1] Hong et al. (2025). Context Rot: How Increasing Input Tokens Impacts LLM Performance. Chroma Research.
[2] Liu et al. (2024). Lost in the Middle: How Language Models Use Long Contexts. TACL.
[3] Yen et al. (2025). HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly. ICLR.
[4] Gao et al. (2025). How to Train Long-Context Language Models (Effectively). ACL.
[5] OpenAI (2025). Introducing GPT-5.2. https://openai.com/index/introducing-gpt-5-2/
[6] Anthropic (2025). Introducing Claude Opus 4.5. https://www.anthropic.com/news/claude-opus-4-5
[7] DeepSeek AI (2025). DeepSeek-V3.2-Exp: Fine-grained Sparse Attention.
[8] Native Sparse Attention. ACL 2025.

