BotBeat
...
← Back

> ▌

Google / AlphabetGoogle / Alphabet
RESEARCHGoogle / Alphabet2026-05-18

Researchers Expose Critical Data Quality Issues in Kaggle Datasets Used to Train Clinical AI Models

Key Takeaways

  • ▸A peer-reviewed stroke detection paper was trained on a Kaggle dataset containing duplicates, celebrity images, and medically misrepresented data, raising questions about clinical validity
  • ▸Researchers traced 124 published papers using two problematic Kaggle datasets, all lacking required data provenance documentation (who, when, where, why)
  • ▸The discovery has triggered paper retractions and publisher investigations, indicating growing scrutiny of dataset quality in AI research
Source:
Hacker Newshttps://retractionwatch.com/2026/05/18/kaggle-dataset-clinical-models-stroke-diabetes/↗

Summary

Researchers at Queensland University of Technology discovered that a Scientific Reports paper on stroke detection was trained using a severely flawed Kaggle dataset containing duplicate images, celebrity photos (Sylvester Stallone, George Clooney, Angelina Jolie, Daniel Craig), images of Bell's palsy misrepresented as stroke, and photos of children—despite the dataset's claim of representing 1,024 'different patients.' The discovery is part of a broader investigation by statistician Adrian Barnett and Ph.D. student Alexander Gibson into data provenance issues across Kaggle, a Google-owned platform for sharing datasets used in machine learning research.

Through systematic tracing of datasets across the scientific literature, the researchers documented how these problematic datasets move from Kaggle into clinical applications and peer-reviewed publications. Their medRxiv preprint identified 124 published papers built on just two Kaggle datasets (stroke and diabetes) that lacked basic data provenance information. The findings have already prompted paper retractions, and Springer Nature added an editor's note to the stroke detection paper warning readers of data reliability concerns and indicating further editorial action is forthcoming.

The incident reflects a systemic vulnerability in open-source research infrastructure. Kaggle has faced previous scrutiny—in December, nearly 40 publications were flagged for training models on children's faces without consent or verification. The researchers argue this problem likely extends to thousands of papers across multiple repositories. As Barnett stated: 'This is clearly not suitable for serious research, it's ethically and scientifically inappropriate.'

  • The problem likely extends to thousands of papers across open-source repositories, suggesting a critical infrastructure gap in AI/ML research governance

Editorial Opinion

This discovery exposes a dangerous blind spot in how AI research is conducted and deployed. Training clinical models on unvetted crowd-sourced datasets—especially those lacking basic metadata and ethical review—risks embedding flawed science directly into healthcare systems. Google's Kaggle platform has democratized data access, but without mandatory provenance checklists, institutional oversight, and strict publishing standards for medical AI, we're allowing volume and speed to override rigor. The field urgently needs enforceable data governance frameworks before more clinical models built on compromised datasets reach patients.

Machine LearningData Science & AnalyticsHealthcareEthics & BiasPrivacy & Data

More from Google / Alphabet

Google / AlphabetGoogle / Alphabet
PRODUCT LAUNCH

Google DeepMind Launches Gemini 3.5 Flash: New Lightweight AI Model

2026-05-20
Google / AlphabetGoogle / Alphabet
PARTNERSHIP

Singapore Inks AI Deals with Google

2026-05-20
Google / AlphabetGoogle / Alphabet
UPDATE

Google Overhauls Workspace App Icons with Gradient Design to Emphasize AI Integration

2026-05-20

Comments

Suggested

Generative AIGenerative AI
INDUSTRY REPORT

Barnes & Noble CEO Backs Selling AI-Written Books, Sparking Industry Debate on Transparency Standards

2026-05-20
Research CommunityResearch Community
RESEARCH

New Methodology Proposed for Selecting Runtime Architecture Patterns in Production LLM Agents

2026-05-20
Helmholtz MunichHelmholtz Munich
RESEARCH

MouseMapper: AI Foundation Model Maps Systemic Damage from Obesity at Whole-Body Scale

2026-05-20
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us