First Proof Round One: LLMs Successfully Solve Research-Level Math Problems, Surprising Experts
Key Takeaways
- ▸OpenAI and Google DeepMind's LLMs solved at least 5-6 of 10 research-level math problems in First Proof round one, exceeding expert expectations
- ▸Each AI model demonstrated different mathematical strengths, solving problems the other couldn't, indicating diverse and complementary capabilities
- ▸The results represent a pivotal moment showing LLMs can contribute meaningfully to pure mathematics research through proof generation
Summary
First Proof, a benchmarking initiative designed to evaluate large language models' ability to contribute to pure mathematics research, has completed its inaugural round with surprising results. The test presented 10 lemmas from unpublished mathematical papers to AI companies, with a one-week deadline for solving them. OpenAI's model correctly solved five problems, while Google DeepMind's Aletheia agent solved six (though experts debate the validity of one), demonstrating that current LLMs can generate valid proofs for intermediate mathematical propositions useful to working mathematicians.
The results exceeded expectations among leading mathematicians, with up to eight of the ten problems appearing to have been at least partially solved by AI. Notably, each model solved problems the other couldn't, revealing complementary capabilities. The First Proof team, led by Harvard mathematician Lauren Williams, has announced plans for a second round requiring participating AI companies to provide access and transparency. The benchmarking effort addresses a critical gap: existing metrics were insufficient for evaluating LLMs as mathematical assistants, where the ability to prove smaller lemmas could save researchers significant time in developing larger theorems.
- Round two of First Proof will require participating companies to provide access and transparency as the benchmark becomes increasingly rigorous
Editorial Opinion
The First Proof results represent a watershed moment for AI's integration into mathematical research, demonstrating that LLMs have moved beyond toy problems to tackle genuine research-level challenges. However, the surprising complementarity of different models' capabilities suggests the field is still in early stages—there's no dominant approach yet. As mathematicians like Daniel Litt optimistically frame AI as a collaborative tool rather than a replacement, the emphasis on transparency and rigorous benchmarking in round two will be crucial to building trust and understanding where AI can genuinely accelerate discovery versus where it merely creates plausible-sounding but incorrect proofs.



