The first time I really felt the context window as a wall, I was piping a few thousand lines of BMC logs into a local model and asking it to find the one node that was about to fail. It could do it — when the logs were short. As the haystack grew, the answers got vaguer, then wrong, and not because the failure signal disappeared. It was still in there. The model just stopped being able to find it. More context made it worse, not better.
That failure mode has a name now — context rot — and this paper is the most convincing attempt I’ve read at routing around it without waiting for someone to ship a bigger window. “Recursive Language Models,” from Alex Zhang, Tim Kraska, and Omar Khattab at MIT CSAIL (December 2025), proposes something refreshingly un-architectural: don’t change the model, change what you do with the prompt.
The idea in one sentence
“[A] general inference paradigm that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt.” — from the abstract
That’s the whole move, and it’s worth sitting with. Almost every other answer to “the context is too long” tries to make more of it fit: bigger windows, retrieval that pre-selects what to stuff in, summarization that compresses before the model ever sees the text. All of those decide for the model what survives the bottleneck.
RLMs invert it. The long input doesn’t go into the context window at all — it sits outside, in an environment the model can poke at, the same way an agent pokes at a filesystem. A root LLM gets the prompt not as text to read but as a variable to operate on: slice it, grep it, chunk it, and — crucially — spawn fresh calls to itself on the pieces, then combine the results. The context window stops being the container for your data and becomes the container for a program that reasons about your data.
Why this is more than “chunk it and map-reduce”
My first reaction was skepticism: isn’t this just RAG with extra steps, or a map-reduce loop anyone could write? The distinction the paper makes is that the model is in control of the decomposition, recursively, at inference time — not a fixed pipeline a human wired up in advance.
A retrieval scaffold decides the chunking strategy once, up front. RLMs let the model look at the shape of the actual input and decide how to break it down — and each sub-call can decide to break things down further. It’s recursive, so a chunk that turns out to be dense can be split again; a chunk that’s irrelevant can be dropped after a cheap glance. That adaptivity is the part static scaffolds can’t replicate, and it’s why the authors frame it as inference-time scaling: you spend more compute by recursing deeper, not by cramming more tokens into one forward pass.
The results that made me take it seriously
Two orders of magnitude is the headline — RLMs handle inputs up to ~100× beyond the underlying model’s context window. But the number I found more persuasive is what happens on prompts that already fit: even there, RLMs beat the vanilla model. That’s the tell that this is attacking context rot, not just context length. If you only beat the baseline once you exceed the window, you’ve solved a capacity problem. Beating it inside the window means you’ve solved a quality problem — the model reasons better over text it decomposed itself than over the same text dumped in whole.
Across four long-context tasks, the paper reports (all at comparable cost):
- +26% over compaction-based methods on GPT-5
- +130% over CodeAct with sub-calls
- +13% over Claude Code
And the result I keep thinking about as a homelab person: RLM-Qwen3-8B — an 8B open model wrapped in the recursive paradigm — beat the base Qwen3-8B by 28.3% on average and approached vanilla GPT-5 on three long-context benchmarks. An 8B model you can run yourself, closing in on a frontier model, purely by changing the inference loop around it. No retraining. That’s the kind of leverage that’s actually available to people who don’t own a datacenter.
What it costs
The honest caveats, because a paper note that only lists wins isn’t a note, it’s a press release.
This is more inference, orchestrated. A recursive decomposition means many model calls instead of one, with the latency and bookkeeping that implies — the paper’s “comparable cost” framing is about quality-per-dollar against other long-context scaffolds, not about being cheap in absolute terms. Recursion also has to terminate sensibly; a decomposition that fans out badly is its own failure mode. And “the model controls the chunking” is a strength and a liability at once — you’re trusting the model’s judgment about what to look at, which is exactly the judgment that degrades when it’s overwhelmed. The bet is that it judges far better over a slice than over the whole, and the numbers say it does, but it’s a bet worth naming.
Why it stuck with me
What I like about this paper is the same thing I liked about Toolformer: it reframes a capability as something external and orchestrated rather than something baked into the weights. Toolformer made the model reach outside itself for tools. RLMs make the model reach outside itself for its own context — and then call itself as one of those tools.
It also rhymes with how I’ve watched good agents behave. When I give a coding agent a giant repo, the version that works doesn’t try to read everything; it lists, greps, opens the three files that matter, and ignores the rest. RLMs feel like that instinct formalized into an inference paradigm and pointed at raw text instead of a codebase. The lesson I’m taking into my own homelab work: when a model is drowning in context, the fix usually isn’t a bigger window — it’s giving the model a way to not look at most of it.