First Proof Round One: LLMs Successfully Solve Research-Level Math Problems, Surprising Experts

Key Takeaways

▸OpenAI and Google DeepMind's LLMs solved at least 5-6 of 10 research-level math problems in First Proof round one, exceeding expert expectations
▸Each AI model demonstrated different mathematical strengths, solving problems the other couldn't, indicating diverse and complementary capabilities
▸The results represent a pivotal moment showing LLMs can contribute meaningfully to pure mathematics research through proof generation

Source:

Hacker Newshttps://www.scientificamerican.com/article/as-ai-keeps-improving-mathematicians-struggle-to-foretell-their-own-future/↗

Summary

First Proof, a benchmarking initiative designed to evaluate large language models' ability to contribute to pure mathematics research, has completed its inaugural round with surprising results. The test presented 10 lemmas from unpublished mathematical papers to AI companies, with a one-week deadline for solving them. OpenAI's model correctly solved five problems, while Google DeepMind's Aletheia agent solved six (though experts debate the validity of one), demonstrating that current LLMs can generate valid proofs for intermediate mathematical propositions useful to working mathematicians.

The results exceeded expectations among leading mathematicians, with up to eight of the ten problems appearing to have been at least partially solved by AI. Notably, each model solved problems the other couldn't, revealing complementary capabilities. The First Proof team, led by Harvard mathematician Lauren Williams, has announced plans for a second round requiring participating AI companies to provide access and transparency. The benchmarking effort addresses a critical gap: existing metrics were insufficient for evaluating LLMs as mathematical assistants, where the ability to prove smaller lemmas could save researchers significant time in developing larger theorems.

Round two of First Proof will require participating companies to provide access and transparency as the benchmark becomes increasingly rigorous

Editorial Opinion

The First Proof results represent a watershed moment for AI's integration into mathematical research, demonstrating that LLMs have moved beyond toy problems to tackle genuine research-level challenges. However, the surprising complementarity of different models' capabilities suggests the field is still in early stages—there's no dominant approach yet. As mathematicians like Daniel Litt optimistically frame AI as a collaborative tool rather than a replacement, the emphasis on transparency and rigorous benchmarking in round two will be crucial to building trust and understanding where AI can genuinely accelerate discovery versus where it merely creates plausible-sounding but incorrect proofs.

OpenAI

RESEARCH OpenAI2026-03-18

First Proof Round One: LLMs Successfully Solve Research-Level Math Problems, Surprising Experts

Key Takeaways

▸OpenAI and Google DeepMind's LLMs solved at least 5-6 of 10 research-level math problems in First Proof round one, exceeding expert expectations
▸Each AI model demonstrated different mathematical strengths, solving problems the other couldn't, indicating diverse and complementary capabilities
▸The results represent a pivotal moment showing LLMs can contribute meaningfully to pure mathematics research through proof generation

Source:

Hacker Newshttps://www.scientificamerican.com/article/as-ai-keeps-improving-mathematicians-struggle-to-foretell-their-own-future/↗

Summary

Round two of First Proof will require participating companies to provide access and transparency as the benchmark becomes increasingly rigorous

Editorial Opinion

The First Proof results represent a watershed moment for AI's integration into mathematical research, demonstrating that LLMs have moved beyond toy problems to tackle genuine research-level challenges. However, the surprising complementarity of different models' capabilities suggests the field is still in early stages—there's no dominant approach yet. As mathematicians like Daniel Litt optimistically frame AI as a collaborative tool rather than a replacement, the emphasis on transparency and rigorous benchmarking in round two will be crucial to building trust and understanding where AI can genuinely accelerate discovery versus where it merely creates plausible-sounding but incorrect proofs.

First Proof Round One: LLMs Successfully Solve Research-Level Math Problems, Surprising Experts

Key Takeaways

Summary

Editorial Opinion

More from OpenAI

OpenAI Prepares for IPO After Musk Lawsuit Threat Clears

OpenAI Model Solves 80-Year-Old Planar Unit Distance Problem, Disproving Long-Held Mathematical Assumption

OpenAI Prepares to File to Go Public in Coming Weeks

Comments

Suggested

Google DeepMind Launches Gemini 3.5 Flash: New Lightweight AI Model

SID Achieves Search Breakthrough with SID-1, Outperforming GPT-5 at 1k+ QPS Using Reinforcement Learning

MouseMapper: AI Foundation Model Maps Systemic Damage from Obesity at Whole-Body Scale

First Proof Round One: LLMs Successfully Solve Research-Level Math Problems, Surprising Experts

Key Takeaways

Summary

Editorial Opinion

More from OpenAI

OpenAI Prepares for IPO After Musk Lawsuit Threat Clears

OpenAI Model Solves 80-Year-Old Planar Unit Distance Problem, Disproving Long-Held Mathematical Assumption

OpenAI Prepares to File to Go Public in Coming Weeks

Comments

Suggested

Google DeepMind Launches Gemini 3.5 Flash: New Lightweight AI Model

SID Achieves Search Breakthrough with SID-1, Outperforming GPT-5 at 1k+ QPS Using Reinforcement Learning

MouseMapper: AI Foundation Model Maps Systemic Damage from Obesity at Whole-Body Scale