Study: GPT-4o, Llama 3, Command R+ All Struggle to Assist Non-Experts in Medical Scenarios
Key Takeaways
- ▸LLMs achieve near-perfect scores on medical knowledge exams (94.9% condition identification) but provide minimal assistance to non-expert users in real scenarios (34.5% accuracy—equal to control group)
- ▸Standard medical knowledge benchmarks do not predict interactive effectiveness or usability with real human participants, revealing a critical blindspot in AI evaluation methodology
- ▸Study recommends mandatory human user testing with diverse, non-expert populations before deploying LLMs for public healthcare applications
Summary
A randomized controlled study with 1,298 participants has revealed a stark gap between how well large language models perform on medical knowledge exams versus how effectively they actually help the general public identify medical conditions and choose appropriate care. When tested alone, GPT-4o, Llama 3, and Command R+ achieved 94.9% accuracy in identifying relevant medical conditions and 56.3% accuracy on average for recommending appropriate courses of action. However, when actual participants used these same LLMs to work through ten medical scenarios, the performance plummeted dramatically.
Participants using the LLMs correctly identified relevant conditions in fewer than 34.5% of cases and chose appropriate disposition in fewer than 44.2%—performing no better than the control group that received no LLM assistance. The study identifies user interactions and interface design as the likely culprit, finding that standard medical knowledge benchmarks and simulated patient interactions are not predictive of real-world performance when non-expert users interact with the systems.
The authors strongly recommend that healthcare providers and AI developers conduct systematic human user testing with diverse populations before deploying LLMs for medical advice to the general public. The research suggests that the gap between benchmark performance and real-world effectiveness is a critical barrier to safe deployment of AI systems in healthcare settings.
Editorial Opinion
This research delivers an important cautionary tale for the AI industry: benchmark performance is not a reliable predictor of real-world safety and effectiveness, especially in high-stakes domains like healthcare. While GPT-4o and other leading LLMs achieved near-perfect scores on medical exams, their performance collapsed when real users attempted to interact with them for medical guidance. The dramatic gap—from 94.9% accuracy to 34.5%—underscores the critical importance of human-centered testing before deploying AI systems in healthcare. Healthcare providers and AI companies must resist the temptation to rely solely on benchmark scores and instead invest in systematic user testing with diverse populations.



