minRLM: Token-Efficient Recursive Language Models Achieve 3.6× Better Efficiency While Outperforming Vanilla LLMs
Key Takeaways
- ▸minRLM achieves 3.6× token efficiency gains on GPT-4o mini and 30+ percentage point accuracy improvements over vanilla LLMs on larger models
- ▸By storing data as REPL variables and having models write code to query it, attention only runs on filtered results rather than entire documents, avoiding context window rot
- ▸Costs remain flat regardless of context size, making the approach viable for long-context tasks that would be prohibitively expensive with traditional LLMs
Summary
minRLM, a new token and latency-efficient implementation of Recursive Language Models (RLMs), demonstrates significant improvements over vanilla LLM approaches and reference implementations. The system scores 72.7% on GPT-4o mini (compared to 69.7% official and 69.5% vanilla) while using 3.6× fewer tokens, and achieves even larger gains on larger models, winning 11 of 12 benchmark tasks against vanilla implementations. Rather than pasting large documents into the context window, minRLM stores input data as variables in a Python REPL, allowing the model to write code to query and filter data, with attention running only on the results.
The approach builds on a December 2025 proposal by Zhang, Kraska, and Khattab and extends their validation across 12 tasks and multiple model sizes. A key innovation is that costs remain roughly flat regardless of context size, as large documents (even 7M characters) become as accessible as much smaller ones (7K characters) through code-based navigation rather than wholesale reading. The implementation includes an open-source codebase with every intermediate step in readable, rerunnable Python code, enabling transparency and debugging.
- The pattern aligns with production deployments like Anthropic's improved web search and emerging standards like Model Context Protocol (MCP) for standardizing code execution across AI providers
Editorial Opinion
minRLM represents a meaningful shift in how we should think about LLM efficiency: instead of throwing larger context windows and more tokens at retrieval and analytics problems, using the model as a code generator to query data through a Python sandbox is both cheaper and more accurate. The ~30pp accuracy gap on larger models is striking and suggests this approach deserves serious consideration in production systems. As context window rot becomes a recognized limitation of scaling context length, RLM-style patterns offer a practical alternative that's starting to appear in real-world products.



