BotBeat
...
← Back

> ▌

OpenAIOpenAI
RESEARCHOpenAI2026-06-05

Study: GPT-4o, Llama 3, Command R+ All Struggle to Assist Non-Experts in Medical Scenarios

Key Takeaways

  • ▸LLMs achieve near-perfect scores on medical knowledge exams (94.9% condition identification) but provide minimal assistance to non-expert users in real scenarios (34.5% accuracy—equal to control group)
  • ▸Standard medical knowledge benchmarks do not predict interactive effectiveness or usability with real human participants, revealing a critical blindspot in AI evaluation methodology
  • ▸Study recommends mandatory human user testing with diverse, non-expert populations before deploying LLMs for public healthcare applications
Source:
Hacker Newshttps://www.nature.com/articles/s41591-025-04074-y↗

Summary

A randomized controlled study with 1,298 participants has revealed a stark gap between how well large language models perform on medical knowledge exams versus how effectively they actually help the general public identify medical conditions and choose appropriate care. When tested alone, GPT-4o, Llama 3, and Command R+ achieved 94.9% accuracy in identifying relevant medical conditions and 56.3% accuracy on average for recommending appropriate courses of action. However, when actual participants used these same LLMs to work through ten medical scenarios, the performance plummeted dramatically.

Participants using the LLMs correctly identified relevant conditions in fewer than 34.5% of cases and chose appropriate disposition in fewer than 44.2%—performing no better than the control group that received no LLM assistance. The study identifies user interactions and interface design as the likely culprit, finding that standard medical knowledge benchmarks and simulated patient interactions are not predictive of real-world performance when non-expert users interact with the systems.

The authors strongly recommend that healthcare providers and AI developers conduct systematic human user testing with diverse populations before deploying LLMs for medical advice to the general public. The research suggests that the gap between benchmark performance and real-world effectiveness is a critical barrier to safe deployment of AI systems in healthcare settings.

Editorial Opinion

This research delivers an important cautionary tale for the AI industry: benchmark performance is not a reliable predictor of real-world safety and effectiveness, especially in high-stakes domains like healthcare. While GPT-4o and other leading LLMs achieved near-perfect scores on medical exams, their performance collapsed when real users attempted to interact with them for medical guidance. The dramatic gap—from 94.9% accuracy to 34.5%—underscores the critical importance of human-centered testing before deploying AI systems in healthcare. Healthcare providers and AI companies must resist the temptation to rely solely on benchmark scores and instead invest in systematic user testing with diverse populations.

Large Language Models (LLMs)Natural Language Processing (NLP)HealthcareAI Safety & Alignment

More from OpenAI

OpenAIOpenAI
POLICY & REGULATION

OpenAI Says It Will Comply with Trump's Order Requiring AI Model Reviews

2026-06-05
OpenAIOpenAI
RESEARCH

OpenAI's Codex Chains Decade-Old Exploits Into Critical HTTP/2 DoS Attack Affecting 880,000+ Websites

2026-06-05
OpenAIOpenAI
POLICY & REGULATION

Mathematicians Issue Global Warning on AI's Threat to Research Integrity and Academic Autonomy

2026-06-05

Comments

Suggested

Sakana AISakana AI
RESEARCH

Sakana AI Establishes Recursive Self-Improvement Lab to Advance Autonomous AI Research

2026-06-05
MicrosoftMicrosoft
PRODUCT LAUNCH

Leaked Microsoft Document Exposes Scout AI's 'Addiction' Design Goal

2026-06-05
Open-Source AI EcosystemOpen-Source AI Ecosystem
RESEARCH

Researchers Demonstrate Adaptive AI-Powered Computer Worms Using Open-Weight LLMs

2026-06-05
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us