New Benchmark Tests 51 AI Models on 62,000 Logic Puzzles — Best Model Solves Only 56%
Key Takeaways
- ▸OpenAI's GPT-5.2@xhigh leads with 56% solve rate, but roughly half of all puzzles remain unsolved by any tested model
- ▸Agentic approaches with verifier feedback dramatically outperform single-shot attempts, though solutions require average 29 turns and can take up to 14 hours
- ▸Extended reasoning modes (@high, @xhigh) provide substantial capability improvements, with US closed models vastly outperforming Chinese open-source alternatives
Summary
Researchers have released Pencil Puzzle Bench, a comprehensive benchmark featuring 62,000 pencil puzzles across 94 types (including sudoku, nonograms, and slitherlink) designed to test AI models' multi-step reasoning capabilities. The benchmark, developed by Approximate Labs and detailed in a paper published on arXiv, enables intermediate verification at every step, allowing researchers to track exactly where and how models fail.
In testing 51 frontier models across 300 puzzles, OpenAI's GPT-5.2 at "xhigh" reasoning depth emerged as the leader, solving 56% of puzzles in an agentic mode with verifier feedback. However, approximately half of all puzzles remained unsolved by any tested model. The agentic approach required an average of 29 turns per solution, with the longest attempt consuming roughly 1,200 turns over 14 hours. Direct single-shot attempts performed significantly worse, with the best model achieving only 27% accuracy.
The results revealed stark disparities in reasoning capability. US-based closed models dominated the leaderboard, with three models exceeding 33% solve rates, while the top Chinese open-source model achieved only 6%. Extended reasoning modes (@medium, @high, @xhigh) dramatically improved performance but sometimes caused infrastructure failures. Cost efficiency varied enormously: xAI's Grok 4.1 Fast Reasoning solved puzzles for as little as $0.00033 per success, while Claude Sonnet 4.6 with 1M context cost up to $238.16 per successful solve.
The researchers have made the full dataset, interactive puzzle player, and step-by-step AI solution replays publicly available, providing transparency into how models approach these logic problems. The benchmark represents a significant test of verifiable reasoning capabilities, where correctness can be objectively determined — unlike many subjective AI evaluation tasks.
- Cost per success varies by 720,000x between most and least efficient models, from $0.00033 to $238.16
- The 62,000-puzzle benchmark with 94 puzzle types provides objective, verifiable testing of multi-step reasoning with intermediate step validation
Editorial Opinion
This benchmark fills a critical gap in AI evaluation by providing objectively verifiable reasoning tasks where partial progress can be tracked at every step. The fact that half the puzzles remain unsolved by any frontier model, despite agentic iteration over hundreds of turns, suggests these logic puzzles capture reasoning challenges that current architectures struggle with fundamentally. The massive cost variance between models achieving similar results raises important questions about the economics of reasoning-heavy applications, particularly as extended reasoning modes approach infrastructure limits.


