RegexPSPACE: New Benchmark Exposes LLM Limitations in Spatial Reasoning
Key Takeaways
- ▸Current LLMs and LRMs exhibit significant limitations in spatial reasoning on PSPACE-complete problems, a higher complexity class than previously benchmarked
- ▸All tested models showed common failure patterns including excessive verbosity and repetitive reasoning steps
- ▸First empirical framework to systematically evaluate the spatial computational limits of modern language and reasoning models
Summary
Researchers have introduced RegexPSPACE, a novel benchmark designed to rigorously evaluate the computational limits of large language models (LLMs) and large reasoning models (LRMs) on PSPACE-complete problems. The benchmark focuses on two challenging regular expression tasks—equivalence decision and minimization—that demand extensive search space exploration, pushing beyond typical NP-class complexity evaluations. Testing revealed consistent failure patterns across 6 LLMs and 5 LRMs of varying scales, including issues with verbosity and repetitive reasoning, highlighting significant gaps in models' spatial computational capacity.
The researchers constructed over a million labeled regex instances using a double-exponential space exploration method, establishing the first empirical investigation into the spatial complexity limits of modern LLMs and LRMs. This work provides a quantitatively rigorous framework for assessing advanced reasoning capabilities and complements the growing focus on explicit reasoning in large models. The benchmark and code are publicly available, offering the AI research community a new tool for stress-testing model reasoning under computationally demanding conditions.
- Open-source benchmark and million-instance dataset enable ongoing research into model reasoning capabilities
Editorial Opinion
RegexPSPACE arrives at a critical moment when LLMs and reasoning models are advancing rapidly, yet their fundamental computational constraints remain underexplored. By introducing PSPACE-complete problems—a genuine increase in rigor over NP-class benchmarks—this research clarifies that reasoning models, despite impressive capabilities, still struggle with problems requiring massive search space exploration. This work is essential for the field's understanding of where current architectures hit their ceiling and should inform the design of next-generation models.



