BotBeat
...
← Back

> ▌

OpenAIOpenAI
RESEARCHOpenAI2026-03-03

New Benchmark Tests 51 AI Models on 62,000 Logic Puzzles — Best Model Solves Only 56%

Key Takeaways

  • ▸OpenAI's GPT-5.2@xhigh leads with 56% solve rate, but roughly half of all puzzles remain unsolved by any tested model
  • ▸Agentic approaches with verifier feedback dramatically outperform single-shot attempts, though solutions require average 29 turns and can take up to 14 hours
  • ▸Extended reasoning modes (@high, @xhigh) provide substantial capability improvements, with US closed models vastly outperforming Chinese open-source alternatives
Source:
Hacker Newshttps://ppbench.com/↗

Summary

Researchers have released Pencil Puzzle Bench, a comprehensive benchmark featuring 62,000 pencil puzzles across 94 types (including sudoku, nonograms, and slitherlink) designed to test AI models' multi-step reasoning capabilities. The benchmark, developed by Approximate Labs and detailed in a paper published on arXiv, enables intermediate verification at every step, allowing researchers to track exactly where and how models fail.

In testing 51 frontier models across 300 puzzles, OpenAI's GPT-5.2 at "xhigh" reasoning depth emerged as the leader, solving 56% of puzzles in an agentic mode with verifier feedback. However, approximately half of all puzzles remained unsolved by any tested model. The agentic approach required an average of 29 turns per solution, with the longest attempt consuming roughly 1,200 turns over 14 hours. Direct single-shot attempts performed significantly worse, with the best model achieving only 27% accuracy.

The results revealed stark disparities in reasoning capability. US-based closed models dominated the leaderboard, with three models exceeding 33% solve rates, while the top Chinese open-source model achieved only 6%. Extended reasoning modes (@medium, @high, @xhigh) dramatically improved performance but sometimes caused infrastructure failures. Cost efficiency varied enormously: xAI's Grok 4.1 Fast Reasoning solved puzzles for as little as $0.00033 per success, while Claude Sonnet 4.6 with 1M context cost up to $238.16 per successful solve.

The researchers have made the full dataset, interactive puzzle player, and step-by-step AI solution replays publicly available, providing transparency into how models approach these logic problems. The benchmark represents a significant test of verifiable reasoning capabilities, where correctness can be objectively determined — unlike many subjective AI evaluation tasks.

  • Cost per success varies by 720,000x between most and least efficient models, from $0.00033 to $238.16
  • The 62,000-puzzle benchmark with 94 puzzle types provides objective, verifiable testing of multi-step reasoning with intermediate step validation

Editorial Opinion

This benchmark fills a critical gap in AI evaluation by providing objectively verifiable reasoning tasks where partial progress can be tracked at every step. The fact that half the puzzles remain unsolved by any frontier model, despite agentic iteration over hundreds of turns, suggests these logic puzzles capture reasoning challenges that current architectures struggle with fundamentally. The massive cost variance between models achieving similar results raises important questions about the economics of reasoning-heavy applications, particularly as extended reasoning modes approach infrastructure limits.

Large Language Models (LLMs)AI AgentsMachine LearningScience & ResearchOpen Source

More from OpenAI

OpenAIOpenAI
INDUSTRY REPORT

AI Chatbots Are Homogenizing College Classroom Discussions, Yale Students Report

2026-04-05
OpenAIOpenAI
FUNDING & BUSINESS

OpenAI Announces Executive Reshuffle: COO Lightcap Moves to Special Projects, Simo Takes Medical Leave

2026-04-04
OpenAIOpenAI
PARTNERSHIP

OpenAI Acquires TBPN Podcast to Control AI Narrative and Reach Influential Tech Audience

2026-04-04

Comments

Suggested

AnthropicAnthropic
RESEARCH

Inside Claude Code's Dynamic System Prompt Architecture: Anthropic's Complex Context Engineering Revealed

2026-04-05
OracleOracle
POLICY & REGULATION

AI Agents Promise to 'Run the Business'—But Who's Liable When Things Go Wrong?

2026-04-05
AnthropicAnthropic
POLICY & REGULATION

Anthropic Explores AI's Role in Autonomous Weapons Policy with Pentagon Discussion

2026-04-05
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us