Penn State Study: Large Language Models Achieve 76% Accuracy on Healthcare Queries, Raising Patient Safety Concerns
Key Takeaways
- ▸LLMs achieved 76.2% accuracy on healthcare queries, but accuracy varies dramatically by medical specialty, from high performance in OB/GYN to poor performance in neurology and dermatology
- ▸Nearly a quarter of AI-generated healthcare responses contained errors or potential for harm, with internal medicine, neurology, and dermatology showing the worst outcomes
- ▸Prompt specificity and length (60-250 characters) significantly impact LLM accuracy in healthcare contexts
Summary
A new study by Penn State researchers found that large language models respond to everyday health-related questions with approximately 76% accuracy, raising significant concerns about their reliability for patient self-diagnosis. The researchers conducted a "Diagnose-a-thon" competition where 34 participants submitted 212 prompts and AI-generated responses using ChatGPT-4o, ChatGPT-3.5, Google Gemini-1.5 Pro, and Meta's Llama3-8b. Nine board-certified physicians evaluated the responses on accuracy and potential for harm using a six-point scale.
The study revealed dramatic variations in AI performance across medical specialties. Obstetrics and gynecology and otolaryngology (ear, nose, and throat conditions) achieved the best results with high validity and low harm scores, while internal medicine, neurology, and dermatology performed poorly with lower validity scores and higher potential for harm. Researchers found that more specific prompts and prompts between 60-250 characters generated significantly more accurate outputs.
The findings suggest that while AI chatbots show promise as supportive tools for trained physicians, they remain too unreliable for routine patient self-diagnosis. The researchers emphasize that healthcare AI tools may be better suited for use by medical professionals who can validate and contextualize the AI's responses rather than by the general public. The team will present their findings at the ACM FAccT (Fairness, Accountability and Transparency) conference in Montreal on June 25-28, 2026.
- Researchers recommend AI healthcare tools be deployed primarily as physician aids rather than patient-facing diagnostic systems
Editorial Opinion
While 76% accuracy might sound reasonable in isolation, it's fundamentally inadequate for healthcare where diagnostic errors directly impact patient safety. The sharp performance drop in critical specialties like neurology and dermatology is particularly alarming—patients seeking answers about neurological symptoms or skin conditions face substantially higher misdiagnosis risks. This research validates the intuitive concern that general-purpose LLMs lack the specialized medical reasoning needed for reliable patient care, though the work appropriately points toward physician-supervised use cases where human expertise can validate and correct AI suggestions.

![[Please specify]](/logos/1683.png)

