First Proof Round 2: Mathematicians Benchmark AI's Pure Mathematics Capabilities as LLMs Solve Complex Lemmas
Key Takeaways
- ▸OpenAI and Google DeepMind's LLMs successfully solved between 5-6 of 10 challenging mathematical lemmas in First Proof's inaugural round, exceeding expert expectations
- ▸Each AI model demonstrated complementary capabilities, solving problems the other could not, suggesting different architectural approaches may be suited to different mathematical problems
- ▸The second round of First Proof will impose stricter requirements for transparency and access, signaling the mathematics community's commitment to rigorous, open benchmarking of AI capabilities
Summary
The First Proof initiative, a benchmarking effort to assess large language models' ability to contribute to research-level mathematics, has announced a second round with new requirements for transparency and access from participating AI companies. In the first round, results exceeded expectations: OpenAI's models solved at least 5 of 10 proposed lemmas from unpublished papers by Harvard mathematician Lauren Williams and colleagues, while Google DeepMind's Aletheia agent solved approximately 6 problems, with each model demonstrating unique strengths the other lacked.
The initiative emerged from the First Proof team's recognition that existing benchmarks were insufficient for evaluating LLMs as mathematical research assistants. Rather than proving major theorems, the focus is on whether AI can efficiently prove smaller "lemmas"—intermediate propositions that mathematicians use as building blocks toward larger discoveries. The strong performance has surprised even skeptical observers: mathematician Daniel Litt notes that as many as 8 of the 10 problems were at least partially solved by AI, demonstrating rapid capability improvements.
While some mathematicians worry about AI's impact on their field, others remain optimistic. Litt expects AI tools will enhance rather than replace mathematical research, enabling mathematicians to tackle their most ambitious work. The second round's transparency requirements suggest the field is moving toward more rigorous, open evaluation of AI's mathematical abilities—a critical step as these systems increasingly contribute to legitimate research.
- Leading mathematicians express cautious optimism, viewing AI as a tool to augment human research rather than replace it, though the long-term trajectory remains uncertain
Editorial Opinion
First Proof represents an important inflection point in assessing AI's genuine contribution to human knowledge production rather than mere capability showcase. The fact that different models solved complementary problems suggests we're not yet seeing a dominant "winner" in mathematical AI—a healthy state that encourages continued competition and innovation. However, the mathematics community's insistence on transparency and access for round two is essential; benchmarking AI on real, unpublished lemmas from active researchers is far more meaningful than abstract test sets, setting a standard other fields should emulate.


