Study Finds Half of AI Health Answers Are Wrong Despite Sounding Authoritative

Key Takeaways

▸Half of health-related answers from ChatGPT, Gemini, Grok, Meta AI, and DeepSeek are problematic, yet presented with confident, authoritative formatting that misleads readers
▸Reference lists provided by AI chatbots are unreliable, with fabricated citations and broken links appearing across all models—creating false credibility for harmful information
▸Performance varies significantly by topic and question type; open-ended health questions (the most common type) trigger highly problematic responses 32% of the time compared to 7% for closed questions

Source:

Hacker Newshttps://theconversation.com/half-of-ai-health-answers-are-wrong-even-though-they-sound-convincing-new-study-280512↗

Summary

A new study published in BMJ Open reveals that major AI chatbots—including ChatGPT, Gemini, Grok, Meta AI, and DeepSeek—provide problematic health information roughly half the time, despite presenting answers in a convincing, doctor-like format. Researchers systematically tested five popular chatbots with 50 health questions spanning cancer, vaccines, stem cells, nutrition, and athletic performance, finding that nearly 20% of answers were highly problematic, 50% were problematic overall, and 30% were somewhat problematic.

The study exposed critical reliability issues, particularly with references: no chatbot managed a single fully accurate reference list across 25 attempts, with errors ranging from wrong authors and fabricated papers to broken links. Performance varied significantly by topic, with chatbots handling vaccines and cancer reasonably well (still producing problematic answers 25% of the time) but struggling most with nutrition and athletic performance—domains characterized by conflicting information and thinner evidence bases. Open-ended questions proved most problematic, with 32% rated as highly problematic compared to just 7% for closed questions, a distinction that matters because most real-world health queries are open-ended.

Language models predict statistically likely text rather than reasoning about medical evidence, making them inherently unreliable for health information despite training on peer-reviewed research

Editorial Opinion

This study underscores a critical gap between AI capability and safety in high-stakes domains. While the researchers note that stress-testing conditions may overstate real-world error rates, the fact that chatbots routinely fabricate citations and confidently dispense misleading health advice is deeply concerning. The distinction between open-ended and closed questions is particularly alarming because patients naturally ask open-ended questions—exactly the scenario where AI chatbots fail most catastrophically. Until these models can reliably ground medical claims in evidence and acknowledge uncertainty, deploying them as health information sources risks real patient harm.

Study Finds Half of AI Health Answers Are Wrong Despite Sounding Authoritative

Key Takeaways

▸Half of health-related answers from ChatGPT, Gemini, Grok, Meta AI, and DeepSeek are problematic, yet presented with confident, authoritative formatting that misleads readers
▸Reference lists provided by AI chatbots are unreliable, with fabricated citations and broken links appearing across all models—creating false credibility for harmful information
▸Performance varies significantly by topic and question type; open-ended health questions (the most common type) trigger highly problematic responses 32% of the time compared to 7% for closed questions

Summary

Language models predict statistically likely text rather than reasoning about medical evidence, making them inherently unreliable for health information despite training on peer-reviewed research

Editorial Opinion

This study underscores a critical gap between AI capability and safety in high-stakes domains. While the researchers note that stress-testing conditions may overstate real-world error rates, the fact that chatbots routinely fabricate citations and confidently dispense misleading health advice is deeply concerning. The distinction between open-ended and closed questions is particularly alarming because patients naturally ask open-ended questions—exactly the scenario where AI chatbots fail most catastrophically. Until these models can reliably ground medical claims in evidence and acknowledge uncertainty, deploying them as health information sources risks real patient harm.

Study Finds Half of AI Health Answers Are Wrong Despite Sounding Authoritative

Key Takeaways

Summary

Editorial Opinion

More from Google / Alphabet

AI Overviews Bypass Local SEO Fundamentals, Threatening Proximity-Based Search Strategy

Google Upgrades AI Mode in Chrome with Side-by-Side Web Browsing and Multi-Tab Search

Google Reports 75% of New Code Written by AI as Cloud Business Accelerates Agent Era

Comments

Suggested

Anthropic Reaches $1 Trillion Valuation on Secondary Markets

House Lawmakers Witness Demonstration of 'Jailbroken' AI Systems in Chilling Capitol Hill Briefing

Anthropic's Claude Mythos Security Claims Questioned: Critics Say Verification Gap Undermines Trust

Study Finds Half of AI Health Answers Are Wrong Despite Sounding Authoritative

Key Takeaways

Summary

Editorial Opinion

More from Google / Alphabet

AI Overviews Bypass Local SEO Fundamentals, Threatening Proximity-Based Search Strategy

Google Upgrades AI Mode in Chrome with Side-by-Side Web Browsing and Multi-Tab Search

Google Reports 75% of New Code Written by AI as Cloud Business Accelerates Agent Era

Comments

Suggested

Anthropic Reaches $1 Trillion Valuation on Secondary Markets

House Lawmakers Witness Demonstration of 'Jailbroken' AI Systems in Chilling Capitol Hill Briefing

Anthropic's Claude Mythos Security Claims Questioned: Critics Say Verification Gap Undermines Trust