BotBeat
...
← Back

> ▌

ImbueImbue
RESEARCHImbue2026-02-27

Imbue Triples Open-Weight LLM Performance on ARC-AGI-2 Benchmark Using Code Evolution

Key Takeaways

  • ▸Imbue's code evolution method achieves 2-3x performance gains on ARC-AGI-2 across multiple LLMs, with Kimi K2.5 reaching 34% (a new open-weight record)
  • ▸The approach is model-agnostic and cost-efficient, achieving 95.1% accuracy with Gemini 3.1 Pro for $8.71 per task
  • ▸Code evolution combines fitness-based sampling and iterative mutation to enhance reasoning without requiring model fine-tuning
Source:
Hacker Newshttps://imbue.com/research/2026-02-27-arc-agi-2-evolution/↗

Summary

AI research company Imbue has announced a breakthrough in reasoning capabilities through a novel "code evolution" technique that dramatically improves LLM performance on the challenging ARC-AGI-2 benchmark. The method achieved a 2.8x improvement for Kimi K2.5 (an open-weight model), raising its score from 12.1% to 34% — establishing a new record for open-source solutions. The approach also boosted Gemini 3 Flash performance 1.8x (from 34% to 61.4%) and pushed Gemini 3.1 Pro to 95.1% accuracy, approaching state-of-the-art results.

The code evolution method works by combining fitness-based sampling with code mutation to iteratively improve solutions, driven by an underlying base LLM but agnostic to the specific model used. The ARC-AGI benchmark, proposed by François Chollet in 2019, tests what he calls "general fluid intelligence" — the ability to efficiently learn solutions to novel problems through visual pattern recognition and geometric reasoning. Each task involves understanding transformation rules from a small set of examples and applying them to new inputs.

Imbue's results are particularly notable for cost efficiency. The Kimi K2.5 solution achieved 34% accuracy at just $2.67 per task, while the Gemini 3.1 Pro solution reached 95.1% for $8.71 per task — significantly cheaper than competing approaches like Gemini 3 Deep Think ($13.62 per task) or other refinement methods exceeding $30 per task. The company emphasized that the same evolution framework and prompts worked across all three models tested, suggesting broad applicability to reasoning and optimization tasks beyond ARC-AGI-2.

  • Open-weight model performance now exceeds GPT-5.2 on medium reasoning effort, democratizing advanced reasoning capabilities

Editorial Opinion

This research represents a significant democratization of advanced reasoning capabilities, showing that algorithmic improvements can close the gap between open and closed models. The consistent gains across different model tiers suggest code evolution addresses fundamental limitations in how LLMs approach multi-step reasoning problems. Most importantly, the cost-effectiveness of this approach — achieving near-SOTA results at a fraction of the price — could accelerate adoption of reasoning-intensive AI applications across industries where compute budgets have been prohibitive.

Large Language Models (LLMs)Reinforcement LearningMachine LearningScience & ResearchOpen Source

More from Imbue

ImbueImbue
OPEN SOURCE

Imbue Open-Sources LLM-Based Evolution Tool, Claims Universal Code Optimization Breakthrough

2026-02-27

Comments

Suggested

GitHubGitHub
PRODUCT LAUNCH

GitHub Launches Squad: Open Source Multi-Agent AI Framework to Simplify Complex Workflows

2026-04-05
Sweden Polytechnic InstituteSweden Polytechnic Institute
RESEARCH

Research Reveals Brevity Constraints Can Improve LLM Accuracy by Up to 26.3%

2026-04-05
Research CommunityResearch Community
RESEARCH

TELeR: New Taxonomy Framework for Standardizing LLM Prompt Benchmarking on Complex Tasks

2026-04-05
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us