Nature Publishes HLE Benchmark: Expert-Level Academic Questions Expose Gaps in Frontier AI Capabilities
Key Takeaways
- ▸HLE contains 2,500 expert-level academic questions across dozens of subjects, created by domain experts to challenge frontier AI models
- ▸State-of-the-art LLMs show significantly lower accuracy on HLE compared to saturated benchmarks like MMLU where they exceed 90%
- ▸The benchmark is multi-modal with both text and image-based questions, featuring multiple-choice and exact-match formats for automated grading
Summary
The Center for AI Safety, in collaboration with Scale AI and a consortium of academic experts, has published Humanity's Last Exam (HLE) in Nature, introducing a new benchmark designed to assess AI capabilities at the frontier of human knowledge. The benchmark consists of 2,500 multi-modal questions spanning mathematics, humanities, and natural sciences, created by subject-matter experts to challenge even the most advanced large language models.
HLE addresses a critical gap in AI evaluation: existing popular benchmarks like MMLU have become saturated, with state-of-the-art LLMs now achieving over 90% accuracy, making it difficult to meaningfully measure continued progress. In contrast, current frontier models demonstrate significantly lower accuracy on HLE, revealing substantial gaps between AI capabilities and expert-level human performance on closed-ended academic questions.
The benchmark features both multiple-choice and short-answer questions with unambiguous, verifiable solutions that cannot be quickly answered through simple internet retrieval. Questions are original and emphasize world-class mathematics problems designed to test deep reasoning skills applicable across multiple academic domains. The benchmark has been made publicly available at lastexam.ai to inform AI research and policymaking with clearer understanding of model capabilities.
- HLE emphasizes world-class mathematics problems testing deep reasoning skills broadly applicable across academic disciplines
- The benchmark is publicly available at lastexam.ai to support AI research and inform policymaking about actual model capabilities
Editorial Opinion
The publication of HLE in Nature represents a crucial milestone in AI evaluation methodology. As frontier models have rapidly saturated existing benchmarks, the research community has struggled to accurately measure progress and identify remaining capability gaps—a problem that becomes especially critical as AI systems are deployed in high-stakes applications. By establishing a benchmark at the true frontier of human expert knowledge, HLE provides researchers and policymakers with an essential tool for understanding where today's AI actually stands relative to human expertise, rather than relying on inflated scores from easier benchmarks that no longer discriminate between models.



