Nature Publishes HLE Benchmark: Expert-Level Academic Questions Expose Gaps in Frontier AI Capabilities

Key Takeaways

▸HLE contains 2,500 expert-level academic questions across dozens of subjects, created by domain experts to challenge frontier AI models
▸State-of-the-art LLMs show significantly lower accuracy on HLE compared to saturated benchmarks like MMLU where they exceed 90%
▸The benchmark is multi-modal with both text and image-based questions, featuring multiple-choice and exact-match formats for automated grading

Source:

Hacker Newshttps://www.nature.com/articles/s41586-025-09962-4↗

Summary

The Center for AI Safety, in collaboration with Scale AI and a consortium of academic experts, has published Humanity's Last Exam (HLE) in Nature, introducing a new benchmark designed to assess AI capabilities at the frontier of human knowledge. The benchmark consists of 2,500 multi-modal questions spanning mathematics, humanities, and natural sciences, created by subject-matter experts to challenge even the most advanced large language models.

HLE addresses a critical gap in AI evaluation: existing popular benchmarks like MMLU have become saturated, with state-of-the-art LLMs now achieving over 90% accuracy, making it difficult to meaningfully measure continued progress. In contrast, current frontier models demonstrate significantly lower accuracy on HLE, revealing substantial gaps between AI capabilities and expert-level human performance on closed-ended academic questions.

The benchmark features both multiple-choice and short-answer questions with unambiguous, verifiable solutions that cannot be quickly answered through simple internet retrieval. Questions are original and emphasize world-class mathematics problems designed to test deep reasoning skills applicable across multiple academic domains. The benchmark has been made publicly available at lastexam.ai to inform AI research and policymaking with clearer understanding of model capabilities.

HLE emphasizes world-class mathematics problems testing deep reasoning skills broadly applicable across academic disciplines
The benchmark is publicly available at lastexam.ai to support AI research and inform policymaking about actual model capabilities

Editorial Opinion

The publication of HLE in Nature represents a crucial milestone in AI evaluation methodology. As frontier models have rapidly saturated existing benchmarks, the research community has struggled to accurately measure progress and identify remaining capability gaps—a problem that becomes especially critical as AI systems are deployed in high-stakes applications. By establishing a benchmark at the true frontier of human expert knowledge, HLE provides researchers and policymakers with an essential tool for understanding where today's AI actually stands relative to human expertise, rather than relying on inflated scores from easier benchmarks that no longer discriminate between models.

Nature Publishes HLE Benchmark: Expert-Level Academic Questions Expose Gaps in Frontier AI Capabilities

Key Takeaways

▸HLE contains 2,500 expert-level academic questions across dozens of subjects, created by domain experts to challenge frontier AI models
▸State-of-the-art LLMs show significantly lower accuracy on HLE compared to saturated benchmarks like MMLU where they exceed 90%
▸The benchmark is multi-modal with both text and image-based questions, featuring multiple-choice and exact-match formats for automated grading

Summary

HLE emphasizes world-class mathematics problems testing deep reasoning skills broadly applicable across academic disciplines
The benchmark is publicly available at lastexam.ai to support AI research and inform policymaking about actual model capabilities

Editorial Opinion

The publication of HLE in Nature represents a crucial milestone in AI evaluation methodology. As frontier models have rapidly saturated existing benchmarks, the research community has struggled to accurately measure progress and identify remaining capability gaps—a problem that becomes especially critical as AI systems are deployed in high-stakes applications. By establishing a benchmark at the true frontier of human expert knowledge, HLE provides researchers and policymakers with an essential tool for understanding where today's AI actually stands relative to human expertise, rather than relying on inflated scores from easier benchmarks that no longer discriminate between models.

Nature Publishes HLE Benchmark: Expert-Level Academic Questions Expose Gaps in Frontier AI Capabilities

Key Takeaways

Summary

Editorial Opinion

More from Center for AI Safety

Ente Launches Ensu: Privacy-Focused Local LLM App for Personal AI

Ente Launches Ensu: Open-Source Local LLM App with Full Privacy and End-to-End Encryption

New Benchmark Reveals AI Agents Struggle to Automate Real-World Remote Work

Comments

Suggested

Google DeepMind Launches Gemini 3.5 Flash: New Lightweight AI Model

SID Achieves Search Breakthrough with SID-1, Outperforming GPT-5 at 1k+ QPS Using Reinforcement Learning

MouseMapper: AI Foundation Model Maps Systemic Damage from Obesity at Whole-Body Scale

Nature Publishes HLE Benchmark: Expert-Level Academic Questions Expose Gaps in Frontier AI Capabilities

Key Takeaways

Summary

Editorial Opinion

More from Center for AI Safety

Ente Launches Ensu: Privacy-Focused Local LLM App for Personal AI

Ente Launches Ensu: Open-Source Local LLM App with Full Privacy and End-to-End Encryption

New Benchmark Reveals AI Agents Struggle to Automate Real-World Remote Work

Comments

Suggested

Google DeepMind Launches Gemini 3.5 Flash: New Lightweight AI Model

SID Achieves Search Breakthrough with SID-1, Outperforming GPT-5 at 1k+ QPS Using Reinforcement Learning

MouseMapper: AI Foundation Model Maps Systemic Damage from Obesity at Whole-Body Scale