Study: Leading LLMs Fail in 80% of Early Differential Diagnosis Cases, Raising Patient Safety Concerns
Key Takeaways
- ▸LLMs fail to correctly generate early differential diagnoses in over 80% of cases tested, despite 91% accuracy on final diagnosis with complete information
- ▸Early differential diagnosis—where clinicians navigate uncertainty—is the clinical stage where AI systems show their greatest weakness, not strength
- ▸Marketing LLMs as frontline diagnostic agents risks creating false patient confidence and could lead to delayed care, unnecessary procedures, and adverse outcomes
Summary
A new study published in JAMA Network Open has found that leading large language models fail to correctly perform early differential diagnosis in more than 8 out of 10 cases, despite achieving 91% accuracy when provided complete medical information for final diagnosis. Led by Harvard medical student Arya Rao, researchers tested 21 off-the-shelf AI models on 29 standardized clinical vignettes, revealing critical weaknesses precisely where clinical uncertainty matters most.
The research highlights a dangerous gap between LLM capabilities in final diagnosis versus the early reasoning stage where clinicians must weigh multiple possibilities and rule out conditions. While some models achieved 63-78% raw accuracy when partially correct answers are counted, the stricter failure metric underscores that LLMs cannot reliably handle the ambiguous decision-making that characterizes real clinical work. Researchers warn that marketing these systems as diagnostic agents creates false confidence in areas where they are least reliable.
Dr. Marc Succi, a radiologist at Massachusetts General Hospital and paper coauthor, cautioned that LLMs can "project confidence without showing robust reasoning" and that higher success rates on final diagnosis can create a misleading sense of safety. The team argues that real clinical reasoning begins in early differential diagnosis with maximum ambiguity—precisely where current LLMs are weakest—and incorrect early reasoning can lead to delayed care, unnecessary procedures, and significant patient harm.
- Current off-the-shelf LLMs should not be trusted for patient-facing diagnostic reasoning without comprehensive human review and structured oversight
Editorial Opinion
This research delivers an important reality check for the hype surrounding AI in clinical medicine. While LLMs excel at synthesis when given complete information, they fundamentally fail at the reasoning under uncertainty that defines real medical practice. The concerning gap between final diagnosis accuracy and early differential performance suggests that deploying these systems at the frontline of patient care—where they're increasingly marketed—could genuinely harm patients by providing confident-sounding but unreliable guidance. Until LLMs demonstrate robust performance in navigating diagnostic ambiguity, their role must remain strictly advisory and subordinate to human clinical judgment.

