BotBeat
...
← Back

> ▌

Center for AI SafetyCenter for AI Safety
RESEARCHCenter for AI Safety2026-02-27

Nature Publishes HLE Benchmark: Expert-Level Academic Questions Expose Gaps in Frontier AI Capabilities

Key Takeaways

  • ▸HLE contains 2,500 expert-level academic questions across dozens of subjects, created by domain experts to challenge frontier AI models
  • ▸State-of-the-art LLMs show significantly lower accuracy on HLE compared to saturated benchmarks like MMLU where they exceed 90%
  • ▸The benchmark is multi-modal with both text and image-based questions, featuring multiple-choice and exact-match formats for automated grading
Source:
Hacker Newshttps://www.nature.com/articles/s41586-025-09962-4↗

Summary

The Center for AI Safety, in collaboration with Scale AI and a consortium of academic experts, has published Humanity's Last Exam (HLE) in Nature, introducing a new benchmark designed to assess AI capabilities at the frontier of human knowledge. The benchmark consists of 2,500 multi-modal questions spanning mathematics, humanities, and natural sciences, created by subject-matter experts to challenge even the most advanced large language models.

HLE addresses a critical gap in AI evaluation: existing popular benchmarks like MMLU have become saturated, with state-of-the-art LLMs now achieving over 90% accuracy, making it difficult to meaningfully measure continued progress. In contrast, current frontier models demonstrate significantly lower accuracy on HLE, revealing substantial gaps between AI capabilities and expert-level human performance on closed-ended academic questions.

The benchmark features both multiple-choice and short-answer questions with unambiguous, verifiable solutions that cannot be quickly answered through simple internet retrieval. Questions are original and emphasize world-class mathematics problems designed to test deep reasoning skills applicable across multiple academic domains. The benchmark has been made publicly available at lastexam.ai to inform AI research and policymaking with clearer understanding of model capabilities.

  • HLE emphasizes world-class mathematics problems testing deep reasoning skills broadly applicable across academic disciplines
  • The benchmark is publicly available at lastexam.ai to support AI research and inform policymaking about actual model capabilities

Editorial Opinion

The publication of HLE in Nature represents a crucial milestone in AI evaluation methodology. As frontier models have rapidly saturated existing benchmarks, the research community has struggled to accurately measure progress and identify remaining capability gaps—a problem that becomes especially critical as AI systems are deployed in high-stakes applications. By establishing a benchmark at the true frontier of human expert knowledge, HLE provides researchers and policymakers with an essential tool for understanding where today's AI actually stands relative to human expertise, rather than relying on inflated scores from easier benchmarks that no longer discriminate between models.

Large Language Models (LLMs)Machine LearningData Science & AnalyticsScience & ResearchAI Safety & Alignment

More from Center for AI Safety

Center for AI SafetyCenter for AI Safety
PRODUCT LAUNCH

Ente Launches Ensu: Privacy-Focused Local LLM App for Personal AI

2026-03-07
Center for AI SafetyCenter for AI Safety
PRODUCT LAUNCH

Ente Launches Ensu: Open-Source Local LLM App with Full Privacy and End-to-End Encryption

2026-03-03
Center for AI SafetyCenter for AI Safety
RESEARCH

New Benchmark Reveals AI Agents Struggle to Automate Real-World Remote Work

2026-02-26

Comments

Suggested

OracleOracle
POLICY & REGULATION

AI Agents Promise to 'Run the Business'—But Who's Liable When Things Go Wrong?

2026-04-05
AnthropicAnthropic
POLICY & REGULATION

Anthropic Explores AI's Role in Autonomous Weapons Policy with Pentagon Discussion

2026-04-05
Sweden Polytechnic InstituteSweden Polytechnic Institute
RESEARCH

Research Reveals Brevity Constraints Can Improve LLM Accuracy by Up to 26.3%

2026-04-05
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us