BotBeat
...
← Back

> ▌

OpenAIOpenAI
RESEARCHOpenAI2026-05-29

Penn State Study: Large Language Models Achieve 76% Accuracy on Healthcare Queries, Raising Patient Safety Concerns

Key Takeaways

  • ▸LLMs achieved 76.2% accuracy on healthcare queries, but accuracy varies dramatically by medical specialty, from high performance in OB/GYN to poor performance in neurology and dermatology
  • ▸Nearly a quarter of AI-generated healthcare responses contained errors or potential for harm, with internal medicine, neurology, and dermatology showing the worst outcomes
  • ▸Prompt specificity and length (60-250 characters) significantly impact LLM accuracy in healthcare contexts
Source:
Hacker Newshttps://www.psu.edu/news/research/story/calling-doctor-gpt-ai-responses-healthcare-queries-are-nearly-76-accurate↗

Summary

A new study by Penn State researchers found that large language models respond to everyday health-related questions with approximately 76% accuracy, raising significant concerns about their reliability for patient self-diagnosis. The researchers conducted a "Diagnose-a-thon" competition where 34 participants submitted 212 prompts and AI-generated responses using ChatGPT-4o, ChatGPT-3.5, Google Gemini-1.5 Pro, and Meta's Llama3-8b. Nine board-certified physicians evaluated the responses on accuracy and potential for harm using a six-point scale.

The study revealed dramatic variations in AI performance across medical specialties. Obstetrics and gynecology and otolaryngology (ear, nose, and throat conditions) achieved the best results with high validity and low harm scores, while internal medicine, neurology, and dermatology performed poorly with lower validity scores and higher potential for harm. Researchers found that more specific prompts and prompts between 60-250 characters generated significantly more accurate outputs.

The findings suggest that while AI chatbots show promise as supportive tools for trained physicians, they remain too unreliable for routine patient self-diagnosis. The researchers emphasize that healthcare AI tools may be better suited for use by medical professionals who can validate and contextualize the AI's responses rather than by the general public. The team will present their findings at the ACM FAccT (Fairness, Accountability and Transparency) conference in Montreal on June 25-28, 2026.

  • Researchers recommend AI healthcare tools be deployed primarily as physician aids rather than patient-facing diagnostic systems

Editorial Opinion

While 76% accuracy might sound reasonable in isolation, it's fundamentally inadequate for healthcare where diagnostic errors directly impact patient safety. The sharp performance drop in critical specialties like neurology and dermatology is particularly alarming—patients seeking answers about neurological symptoms or skin conditions face substantially higher misdiagnosis risks. This research validates the intuitive concern that general-purpose LLMs lack the specialized medical reasoning needed for reliable patient care, though the work appropriately points toward physician-supervised use cases where human expertise can validate and correct AI suggestions.

Generative AIMachine LearningHealthcareAI Safety & Alignment

More from OpenAI

OpenAIOpenAI
RESEARCH

Language Models Believe False Information Even When Explicitly Warned, Research Finds

2026-05-29
OpenAIOpenAI
INDUSTRY REPORT

Analyst: OpenAI's Sam Altman Engineered 'Spectacular House of Cards,' Pushing Google Toward Self-Destruction

2026-05-29
OpenAIOpenAI
POLICY & REGULATION

Illinois Passes Nation's Strongest AI Safety Bill Requiring Independent Audits of Frontier AI Labs

2026-05-28

Comments

Suggested

[Please specify][Please specify]
RESEARCH

Researchers Propose LLM-Based Approach to Evaluate Retrieval Systems Without Ground-Truth Labels

2026-05-29
AI Industry - Language ModelsAI Industry - Language Models
RESEARCH

Academic Research Warns of Small Language Models as Propaganda Factories, Fully Automated Influence Operations Now Within Reach

2026-05-29
ChainguardChainguard
FUNDING & BUSINESS

Chainguard Commits $50M and 100 Engineers to Combat AI-Powered Open Source Supply Chain Threats

2026-05-29
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us