Study Finds Half of AI Health Answers Are Wrong Despite Sounding Authoritative
Key Takeaways
- ▸Half of health-related answers from ChatGPT, Gemini, Grok, Meta AI, and DeepSeek are problematic, yet presented with confident, authoritative formatting that misleads readers
- ▸Reference lists provided by AI chatbots are unreliable, with fabricated citations and broken links appearing across all models—creating false credibility for harmful information
- ▸Performance varies significantly by topic and question type; open-ended health questions (the most common type) trigger highly problematic responses 32% of the time compared to 7% for closed questions
Summary
A new study published in BMJ Open reveals that major AI chatbots—including ChatGPT, Gemini, Grok, Meta AI, and DeepSeek—provide problematic health information roughly half the time, despite presenting answers in a convincing, doctor-like format. Researchers systematically tested five popular chatbots with 50 health questions spanning cancer, vaccines, stem cells, nutrition, and athletic performance, finding that nearly 20% of answers were highly problematic, 50% were problematic overall, and 30% were somewhat problematic.
The study exposed critical reliability issues, particularly with references: no chatbot managed a single fully accurate reference list across 25 attempts, with errors ranging from wrong authors and fabricated papers to broken links. Performance varied significantly by topic, with chatbots handling vaccines and cancer reasonably well (still producing problematic answers 25% of the time) but struggling most with nutrition and athletic performance—domains characterized by conflicting information and thinner evidence bases. Open-ended questions proved most problematic, with 32% rated as highly problematic compared to just 7% for closed questions, a distinction that matters because most real-world health queries are open-ended.
- Language models predict statistically likely text rather than reasoning about medical evidence, making them inherently unreliable for health information despite training on peer-reviewed research
Editorial Opinion
This study underscores a critical gap between AI capability and safety in high-stakes domains. While the researchers note that stress-testing conditions may overstate real-world error rates, the fact that chatbots routinely fabricate citations and confidently dispense misleading health advice is deeply concerning. The distinction between open-ended and closed questions is particularly alarming because patients naturally ask open-ended questions—exactly the scenario where AI chatbots fail most catastrophically. Until these models can reliably ground medical claims in evidence and acknowledge uncertainty, deploying them as health information sources risks real patient harm.


