Imbue Triples Open-Weight LLM Performance on ARC-AGI-2 Benchmark Using Code Evolution

Key Takeaways

▸Imbue's code evolution method achieves 2-3x performance gains on ARC-AGI-2 across multiple LLMs, with Kimi K2.5 reaching 34% (a new open-weight record)
▸The approach is model-agnostic and cost-efficient, achieving 95.1% accuracy with Gemini 3.1 Pro for $8.71 per task
▸Code evolution combines fitness-based sampling and iterative mutation to enhance reasoning without requiring model fine-tuning

Source:

Hacker Newshttps://imbue.com/research/2026-02-27-arc-agi-2-evolution/↗

Summary

AI research company Imbue has announced a breakthrough in reasoning capabilities through a novel "code evolution" technique that dramatically improves LLM performance on the challenging ARC-AGI-2 benchmark. The method achieved a 2.8x improvement for Kimi K2.5 (an open-weight model), raising its score from 12.1% to 34% — establishing a new record for open-source solutions. The approach also boosted Gemini 3 Flash performance 1.8x (from 34% to 61.4%) and pushed Gemini 3.1 Pro to 95.1% accuracy, approaching state-of-the-art results.

The code evolution method works by combining fitness-based sampling with code mutation to iteratively improve solutions, driven by an underlying base LLM but agnostic to the specific model used. The ARC-AGI benchmark, proposed by François Chollet in 2019, tests what he calls "general fluid intelligence" — the ability to efficiently learn solutions to novel problems through visual pattern recognition and geometric reasoning. Each task involves understanding transformation rules from a small set of examples and applying them to new inputs.

Imbue's results are particularly notable for cost efficiency. The Kimi K2.5 solution achieved 34% accuracy at just $2.67 per task, while the Gemini 3.1 Pro solution reached 95.1% for $8.71 per task — significantly cheaper than competing approaches like Gemini 3 Deep Think ($13.62 per task) or other refinement methods exceeding $30 per task. The company emphasized that the same evolution framework and prompts worked across all three models tested, suggesting broad applicability to reasoning and optimization tasks beyond ARC-AGI-2.

Open-weight model performance now exceeds GPT-5.2 on medium reasoning effort, democratizing advanced reasoning capabilities

Editorial Opinion

This research represents a significant democratization of advanced reasoning capabilities, showing that algorithmic improvements can close the gap between open and closed models. The consistent gains across different model tiers suggest code evolution addresses fundamental limitations in how LLMs approach multi-step reasoning problems. Most importantly, the cost-effectiveness of this approach — achieving near-SOTA results at a fraction of the price — could accelerate adoption of reasoning-intensive AI applications across industries where compute budgets have been prohibitive.

Imbue Triples Open-Weight LLM Performance on ARC-AGI-2 Benchmark Using Code Evolution

Key Takeaways

▸Imbue's code evolution method achieves 2-3x performance gains on ARC-AGI-2 across multiple LLMs, with Kimi K2.5 reaching 34% (a new open-weight record)
▸The approach is model-agnostic and cost-efficient, achieving 95.1% accuracy with Gemini 3.1 Pro for $8.71 per task
▸Code evolution combines fitness-based sampling and iterative mutation to enhance reasoning without requiring model fine-tuning

Summary

Open-weight model performance now exceeds GPT-5.2 on medium reasoning effort, democratizing advanced reasoning capabilities

Editorial Opinion

This research represents a significant democratization of advanced reasoning capabilities, showing that algorithmic improvements can close the gap between open and closed models. The consistent gains across different model tiers suggest code evolution addresses fundamental limitations in how LLMs approach multi-step reasoning problems. Most importantly, the cost-effectiveness of this approach — achieving near-SOTA results at a fraction of the price — could accelerate adoption of reasoning-intensive AI applications across industries where compute budgets have been prohibitive.

Imbue Triples Open-Weight LLM Performance on ARC-AGI-2 Benchmark Using Code Evolution

Key Takeaways

Summary

Editorial Opinion

More from Imbue

Imbue Open-Sources LLM-Based Evolution Tool, Claims Universal Code Optimization Breakthrough

Comments

Suggested

Stanford Researchers Use Multi-Agent AI and Reinforcement Learning to Improve HIP Kernel Generation for AMD GPUs

Meta Acknowledges AI Agent Development Slower Than Expected, Despite $145B Infrastructure Investment

Literary Prize Scandal Exposes Limitations of AI Detection Tools

Imbue Triples Open-Weight LLM Performance on ARC-AGI-2 Benchmark Using Code Evolution

Key Takeaways

Summary

Editorial Opinion

More from Imbue

Imbue Open-Sources LLM-Based Evolution Tool, Claims Universal Code Optimization Breakthrough

Comments

Suggested

Stanford Researchers Use Multi-Agent AI and Reinforcement Learning to Improve HIP Kernel Generation for AMD GPUs

Meta Acknowledges AI Agent Development Slower Than Expected, Despite $145B Infrastructure Investment

Literary Prize Scandal Exposes Limitations of AI Detection Tools