Pure LLMs Score 0% on ARC-AGI-2 Benchmark, Raising Questions About Path to AGI
Key Takeaways
- ▸Pure LLMs score 0% on the ARC-AGI-2 benchmark, which tests abstract reasoning and general intelligence rather than pattern matching
- ▸The results draw parallels between modern LLM limitations and the failures of first-wave symbolic AI, suggesting current scaling approaches may not lead to AGI
- ▸The findings indicate that achieving artificial general intelligence may require hybrid systems or fundamentally different architectures beyond pure transformer models
Summary
A provocative new analysis reveals that pure large language models achieve a 0% success rate on the ARC-AGI-2 benchmark, a test designed to measure abstract reasoning and general intelligence capabilities. The finding, reported by Aedelon, suggests that despite massive scaling efforts and improvements in specific tasks, current LLM architectures may lack fundamental capabilities required for artificial general intelligence. The ARC-AGI benchmark, created by François Chollet, specifically tests for fluid intelligence and the ability to solve novel problems without prior training, distinguishing it from knowledge-based or pattern-matching tasks where LLMs excel.
The article draws a striking parallel between today's "third wave" of AI and the "first wave" of symbolic AI systems from the 1950s-1980s, arguing that both approaches achieve impressive performance on narrow tasks while failing at general reasoning. This comparison challenges the prevailing narrative that simply scaling up transformer-based models will inevitably lead to AGI. The zero-percent score highlights a potential ceiling in pure LLM capabilities when confronted with tasks requiring true abstraction and reasoning rather than pattern recognition from training data.
The findings have significant implications for AI research directions and investment strategies. While hybrid approaches combining LLMs with other techniques have shown better results on ARC-AGI, the performance of pure LLMs suggests that alternative architectures or fundamentally different approaches may be necessary to achieve human-like general intelligence. This could redirect research efforts toward neuro-symbolic systems, more structured reasoning mechanisms, or entirely novel paradigms beyond the current transformer-dominated landscape.
- The benchmark specifically tests fluid intelligence and novel problem-solving, capabilities that appear distinct from the knowledge retrieval and pattern recognition where LLMs excel
Editorial Opinion
This benchmark result serves as a crucial reality check for the AI industry's AGI ambitions. While the zero-percent score may seem alarming, it actually provides valuable clarity about what LLMs can and cannot do, helping separate genuine progress toward general intelligence from impressive but narrow capabilities. The comparison to first-wave AI is particularly illuminating—it suggests we may be optimizing for the wrong metrics and that true AGI might require acknowledging LLMs' limitations rather than simply scaling them further. This should energize research into hybrid approaches and alternative architectures rather than discourage AI development.



