Blueprint Bench: First Signs of 3D Spatial Intelligence in LLMs
Key Takeaways
- ▸OpenAI's GPT 5.5 achieves highest spatial reasoning score (0.36 connectivity similarity) on Blueprint-Bench 2, with Google Gemini 3.1 Pro and Anthropic Claude Opus 4.7 in close pursuit
- ▸Room-to-room connectivity inference is the key performance discriminator—where lower-performing models fail despite 90% accuracy on basic room counting
- ▸Specialized robotics-focused models underperformed, suggesting general-purpose models may have superior spatial reasoning capabilities
Summary
Blueprint-Bench 2, a new benchmark for evaluating 3D spatial reasoning in large language models, reveals that leading AI models are demonstrating genuine spatial intelligence when tasked with converting apartment photographs into accurate 2D floor plans. The benchmark tests models on their ability to identify rooms, infer spatial relationships, understand scale, and generate structured output—a task that requires far more than memorized training data.
According to the latest results, OpenAI's GPT 5.5 leads the field with a connectivity similarity score of 0.36, followed by Google's Gemini 3.1 Pro (0.27) and Anthropic's Claude Opus 4.7 (0.25). The benchmark includes a persistent notepad feature that allows agents to record strategies and lessons learned across 50 sequential apartment evaluations, enabling iterative learning and pattern recognition. The key discriminator between high and low performers is the ability to correctly infer room-to-room connectivity—most models achieve ~90% accuracy on counting rooms but struggle to map which rooms connect to which.
A notable finding is that specialized models underperformed expectations. Google's Gemini Robotics-ER 1.6, despite being designed specifically for spatial and embodied reasoning, scored below the general-purpose Gemini 3 Flash. The benchmark demonstrates what researchers call 'sparks of spatial reasoning,' with top models successfully reversing camera direction using visual landmarks and inferring through-room connectivity from multiple doorways—capabilities absent entirely in the original Blueprint-Bench.
- Persistent learning across sequential tasks improves model performance, demonstrating value of maintaining context and extracting patterns across multiple spatial reasoning problems
- This marks the first evidence of genuine 3D spatial intelligence in LLMs, a significant jump from the 'essentially noise' outputs of the original Blueprint-Bench
Editorial Opinion
Blueprint-Bench 2 is an important milestone in understanding LLM capabilities beyond language. The emergence of genuine spatial reasoning in leading models suggests AI systems are developing more sophisticated forms of intelligence that blend multiple modalities and reasoning types. However, the performance gap between top models and the field—and especially the underperformance of task-specialized models—raises questions about what architectural features and training approaches best develop spatial reasoning. As benchmarks like this become standard, we'll see whether spatial intelligence is fundamental to all capable AI systems or remains a rare trait.



