Penn State Study: Large Language Models Achieve 76% Accuracy on Healthcare Queries, Raising Patient Safety Concerns

Key Takeaways

▸LLMs achieved 76.2% accuracy on healthcare queries, but accuracy varies dramatically by medical specialty, from high performance in OB/GYN to poor performance in neurology and dermatology
▸Nearly a quarter of AI-generated healthcare responses contained errors or potential for harm, with internal medicine, neurology, and dermatology showing the worst outcomes
▸Prompt specificity and length (60-250 characters) significantly impact LLM accuracy in healthcare contexts

Source:

Hacker Newshttps://www.psu.edu/news/research/story/calling-doctor-gpt-ai-responses-healthcare-queries-are-nearly-76-accurate↗

Summary

A new study by Penn State researchers found that large language models respond to everyday health-related questions with approximately 76% accuracy, raising significant concerns about their reliability for patient self-diagnosis. The researchers conducted a "Diagnose-a-thon" competition where 34 participants submitted 212 prompts and AI-generated responses using ChatGPT-4o, ChatGPT-3.5, Google Gemini-1.5 Pro, and Meta's Llama3-8b. Nine board-certified physicians evaluated the responses on accuracy and potential for harm using a six-point scale.

The study revealed dramatic variations in AI performance across medical specialties. Obstetrics and gynecology and otolaryngology (ear, nose, and throat conditions) achieved the best results with high validity and low harm scores, while internal medicine, neurology, and dermatology performed poorly with lower validity scores and higher potential for harm. Researchers found that more specific prompts and prompts between 60-250 characters generated significantly more accurate outputs.

The findings suggest that while AI chatbots show promise as supportive tools for trained physicians, they remain too unreliable for routine patient self-diagnosis. The researchers emphasize that healthcare AI tools may be better suited for use by medical professionals who can validate and contextualize the AI's responses rather than by the general public. The team will present their findings at the ACM FAccT (Fairness, Accountability and Transparency) conference in Montreal on June 25-28, 2026.

Researchers recommend AI healthcare tools be deployed primarily as physician aids rather than patient-facing diagnostic systems

Editorial Opinion

While 76% accuracy might sound reasonable in isolation, it's fundamentally inadequate for healthcare where diagnostic errors directly impact patient safety. The sharp performance drop in critical specialties like neurology and dermatology is particularly alarming—patients seeking answers about neurological symptoms or skin conditions face substantially higher misdiagnosis risks. This research validates the intuitive concern that general-purpose LLMs lack the specialized medical reasoning needed for reliable patient care, though the work appropriately points toward physician-supervised use cases where human expertise can validate and correct AI suggestions.

Penn State Study: Large Language Models Achieve 76% Accuracy on Healthcare Queries, Raising Patient Safety Concerns

Key Takeaways

▸LLMs achieved 76.2% accuracy on healthcare queries, but accuracy varies dramatically by medical specialty, from high performance in OB/GYN to poor performance in neurology and dermatology
▸Nearly a quarter of AI-generated healthcare responses contained errors or potential for harm, with internal medicine, neurology, and dermatology showing the worst outcomes
▸Prompt specificity and length (60-250 characters) significantly impact LLM accuracy in healthcare contexts

Summary

Researchers recommend AI healthcare tools be deployed primarily as physician aids rather than patient-facing diagnostic systems

Editorial Opinion

While 76% accuracy might sound reasonable in isolation, it's fundamentally inadequate for healthcare where diagnostic errors directly impact patient safety. The sharp performance drop in critical specialties like neurology and dermatology is particularly alarming—patients seeking answers about neurological symptoms or skin conditions face substantially higher misdiagnosis risks. This research validates the intuitive concern that general-purpose LLMs lack the specialized medical reasoning needed for reliable patient care, though the work appropriately points toward physician-supervised use cases where human expertise can validate and correct AI suggestions.

Penn State Study: Large Language Models Achieve 76% Accuracy on Healthcare Queries, Raising Patient Safety Concerns

Key Takeaways

Summary

Editorial Opinion

More from OpenAI

SociaLLM Engineering: A New Threat Vector Against AI Agents

Datacenter Opposition Misses the Bigger Picture: AI Companies' Real Target Is Entire Industries

Expert Exodus: AI's Unintended Consequence as High-Skilled Contributors Abandon Knowledge Communities

Comments

Suggested

Anthropic Releases Turnstile, Open-Source Proxy for Precise Token Capture in Agent Reinforcement Learning

state-harness: Framework for Predicting Multi-Agent AI Failures Gains Empirical Validation

Anthropic Introduces J-Lens: New Technique Reveals Dual Representational Routes in Claude

Penn State Study: Large Language Models Achieve 76% Accuracy on Healthcare Queries, Raising Patient Safety Concerns

Key Takeaways

Summary

Editorial Opinion

More from OpenAI

SociaLLM Engineering: A New Threat Vector Against AI Agents

Datacenter Opposition Misses the Bigger Picture: AI Companies' Real Target Is Entire Industries

Expert Exodus: AI's Unintended Consequence as High-Skilled Contributors Abandon Knowledge Communities

Comments

Suggested

Anthropic Releases Turnstile, Open-Source Proxy for Precise Token Capture in Agent Reinforcement Learning

state-harness: Framework for Predicting Multi-Agent AI Failures Gains Empirical Validation

Anthropic Introduces J-Lens: New Technique Reveals Dual Representational Routes in Claude