Researchers Expose Critical Data Quality Issues in Kaggle Datasets Used to Train Clinical AI Models
Key Takeaways
- ▸A peer-reviewed stroke detection paper was trained on a Kaggle dataset containing duplicates, celebrity images, and medically misrepresented data, raising questions about clinical validity
- ▸Researchers traced 124 published papers using two problematic Kaggle datasets, all lacking required data provenance documentation (who, when, where, why)
- ▸The discovery has triggered paper retractions and publisher investigations, indicating growing scrutiny of dataset quality in AI research
Summary
Researchers at Queensland University of Technology discovered that a Scientific Reports paper on stroke detection was trained using a severely flawed Kaggle dataset containing duplicate images, celebrity photos (Sylvester Stallone, George Clooney, Angelina Jolie, Daniel Craig), images of Bell's palsy misrepresented as stroke, and photos of children—despite the dataset's claim of representing 1,024 'different patients.' The discovery is part of a broader investigation by statistician Adrian Barnett and Ph.D. student Alexander Gibson into data provenance issues across Kaggle, a Google-owned platform for sharing datasets used in machine learning research.
Through systematic tracing of datasets across the scientific literature, the researchers documented how these problematic datasets move from Kaggle into clinical applications and peer-reviewed publications. Their medRxiv preprint identified 124 published papers built on just two Kaggle datasets (stroke and diabetes) that lacked basic data provenance information. The findings have already prompted paper retractions, and Springer Nature added an editor's note to the stroke detection paper warning readers of data reliability concerns and indicating further editorial action is forthcoming.
The incident reflects a systemic vulnerability in open-source research infrastructure. Kaggle has faced previous scrutiny—in December, nearly 40 publications were flagged for training models on children's faces without consent or verification. The researchers argue this problem likely extends to thousands of papers across multiple repositories. As Barnett stated: 'This is clearly not suitable for serious research, it's ethically and scientifically inappropriate.'
- The problem likely extends to thousands of papers across open-source repositories, suggesting a critical infrastructure gap in AI/ML research governance
Editorial Opinion
This discovery exposes a dangerous blind spot in how AI research is conducted and deployed. Training clinical models on unvetted crowd-sourced datasets—especially those lacking basic metadata and ethical review—risks embedding flawed science directly into healthcare systems. Google's Kaggle platform has democratized data access, but without mandatory provenance checklists, institutional oversight, and strict publishing standards for medical AI, we're allowing volume and speed to override rigor. The field urgently needs enforceable data governance frameworks before more clinical models built on compromised datasets reach patients.



