BotBeat
...
← Back

> ▌

OpenAIOpenAI
RESEARCHOpenAI2026-03-03

New Benchmark Tests 51 AI Models on 62,000 Logic Puzzles — Best Model Solves Only 56%

Key Takeaways

  • ▸OpenAI's GPT-5.2@xhigh leads with 56% solve rate, but roughly half of all puzzles remain unsolved by any tested model
  • ▸Agentic approaches with verifier feedback dramatically outperform single-shot attempts, though solutions require average 29 turns and can take up to 14 hours
  • ▸Extended reasoning modes (@high, @xhigh) provide substantial capability improvements, with US closed models vastly outperforming Chinese open-source alternatives
Source:
Hacker Newshttps://ppbench.com/↗

Summary

Researchers have released Pencil Puzzle Bench, a comprehensive benchmark featuring 62,000 pencil puzzles across 94 types (including sudoku, nonograms, and slitherlink) designed to test AI models' multi-step reasoning capabilities. The benchmark, developed by Approximate Labs and detailed in a paper published on arXiv, enables intermediate verification at every step, allowing researchers to track exactly where and how models fail.

In testing 51 frontier models across 300 puzzles, OpenAI's GPT-5.2 at "xhigh" reasoning depth emerged as the leader, solving 56% of puzzles in an agentic mode with verifier feedback. However, approximately half of all puzzles remained unsolved by any tested model. The agentic approach required an average of 29 turns per solution, with the longest attempt consuming roughly 1,200 turns over 14 hours. Direct single-shot attempts performed significantly worse, with the best model achieving only 27% accuracy.

The results revealed stark disparities in reasoning capability. US-based closed models dominated the leaderboard, with three models exceeding 33% solve rates, while the top Chinese open-source model achieved only 6%. Extended reasoning modes (@medium, @high, @xhigh) dramatically improved performance but sometimes caused infrastructure failures. Cost efficiency varied enormously: xAI's Grok 4.1 Fast Reasoning solved puzzles for as little as $0.00033 per success, while Claude Sonnet 4.6 with 1M context cost up to $238.16 per successful solve.

The researchers have made the full dataset, interactive puzzle player, and step-by-step AI solution replays publicly available, providing transparency into how models approach these logic problems. The benchmark represents a significant test of verifiable reasoning capabilities, where correctness can be objectively determined — unlike many subjective AI evaluation tasks.

  • Cost per success varies by 720,000x between most and least efficient models, from $0.00033 to $238.16
  • The 62,000-puzzle benchmark with 94 puzzle types provides objective, verifiable testing of multi-step reasoning with intermediate step validation

Editorial Opinion

This benchmark fills a critical gap in AI evaluation by providing objectively verifiable reasoning tasks where partial progress can be tracked at every step. The fact that half the puzzles remain unsolved by any frontier model, despite agentic iteration over hundreds of turns, suggests these logic puzzles capture reasoning challenges that current architectures struggle with fundamentally. The massive cost variance between models achieving similar results raises important questions about the economics of reasoning-heavy applications, particularly as extended reasoning modes approach infrastructure limits.

Large Language Models (LLMs)AI AgentsMachine LearningScience & ResearchOpen Source

More from OpenAI

OpenAIOpenAI
FUNDING & BUSINESS

OpenAI Prepares for IPO After Musk Lawsuit Threat Clears

2026-05-20
OpenAIOpenAI
RESEARCH

OpenAI Model Solves 80-Year-Old Planar Unit Distance Problem, Disproving Long-Held Mathematical Assumption

2026-05-20
OpenAIOpenAI
FUNDING & BUSINESS

OpenAI Prepares to File to Go Public in Coming Weeks

2026-05-20

Comments

Suggested

Research CommunityResearch Community
RESEARCH

New Methodology Proposed for Selecting Runtime Architecture Patterns in Production LLM Agents

2026-05-20
Google / AlphabetGoogle / Alphabet
PRODUCT LAUNCH

Google DeepMind Launches Gemini 3.5 Flash: New Lightweight AI Model

2026-05-20
Executive Office of the President of the United States (Policy/Regulation)Executive Office of the President of the United States (Policy/Regulation)
RESEARCH

SID Achieves Search Breakthrough with SID-1, Outperforming GPT-5 at 1k+ QPS Using Reinforcement Learning

2026-05-20
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us