Imbue Triples Open-Weight LLM Performance on ARC-AGI-2 Benchmark Using Code Evolution
Key Takeaways
- ▸Imbue's code evolution method achieves 2-3x performance gains on ARC-AGI-2 across multiple LLMs, with Kimi K2.5 reaching 34% (a new open-weight record)
- ▸The approach is model-agnostic and cost-efficient, achieving 95.1% accuracy with Gemini 3.1 Pro for $8.71 per task
- ▸Code evolution combines fitness-based sampling and iterative mutation to enhance reasoning without requiring model fine-tuning
Summary
AI research company Imbue has announced a breakthrough in reasoning capabilities through a novel "code evolution" technique that dramatically improves LLM performance on the challenging ARC-AGI-2 benchmark. The method achieved a 2.8x improvement for Kimi K2.5 (an open-weight model), raising its score from 12.1% to 34% — establishing a new record for open-source solutions. The approach also boosted Gemini 3 Flash performance 1.8x (from 34% to 61.4%) and pushed Gemini 3.1 Pro to 95.1% accuracy, approaching state-of-the-art results.
The code evolution method works by combining fitness-based sampling with code mutation to iteratively improve solutions, driven by an underlying base LLM but agnostic to the specific model used. The ARC-AGI benchmark, proposed by François Chollet in 2019, tests what he calls "general fluid intelligence" — the ability to efficiently learn solutions to novel problems through visual pattern recognition and geometric reasoning. Each task involves understanding transformation rules from a small set of examples and applying them to new inputs.
Imbue's results are particularly notable for cost efficiency. The Kimi K2.5 solution achieved 34% accuracy at just $2.67 per task, while the Gemini 3.1 Pro solution reached 95.1% for $8.71 per task — significantly cheaper than competing approaches like Gemini 3 Deep Think ($13.62 per task) or other refinement methods exceeding $30 per task. The company emphasized that the same evolution framework and prompts worked across all three models tested, suggesting broad applicability to reasoning and optimization tasks beyond ARC-AGI-2.
- Open-weight model performance now exceeds GPT-5.2 on medium reasoning effort, democratizing advanced reasoning capabilities
Editorial Opinion
This research represents a significant democratization of advanced reasoning capabilities, showing that algorithmic improvements can close the gap between open and closed models. The consistent gains across different model tiers suggest code evolution addresses fundamental limitations in how LLMs approach multi-step reasoning problems. Most importantly, the cost-effectiveness of this approach — achieving near-SOTA results at a fraction of the price — could accelerate adoption of reasoning-intensive AI applications across industries where compute budgets have been prohibitive.



