AI Disease-Prediction Models Trained on Potentially Fabricated Data, Study Warns
Key Takeaways
- ▸Over 124 peer-reviewed studies used two open-access health datasets with no documented provenance to train AI models for stroke and diabetes prediction
- ▸Statistical analysis revealed strong indicators of data fabrication, including suspiciously low rates of missing data compared to real-world health datasets
- ▸At least two AI models trained on these datasets are already in clinical use in hospitals in Indonesia and Spain, and are publicly available online, potentially affecting patient care decisions
Summary
Researchers have identified over 124 peer-reviewed papers that trained artificial-intelligence models for predicting stroke and diabetes risk using two open-access health datasets of highly questionable origin. A preprint study by Adrian Barnett and colleagues at Queensland University of Technology found multiple statistical anomalies in the datasets—including suspiciously complete data with almost no missing values—that would be highly unusual in real-world health information, raising serious concerns about potential data fabrication. Some of these models have already been deployed in clinical settings in Indonesia and Spain, and are available as public web tools that allow individuals to self-assess their disease risk based on unreliable underlying data.
The findings have prompted investigations by at least two academic journals and renewed calls for mandatory data-source transparency in AI medical applications. Experts warn that prediction models trained on unverified or fabricated data are "intrinsically unreliable" and could lead clinicians to make inappropriate treatment decisions—either prescribing unnecessary medications or withholding needed care. The two datasets in question, uploaded to Kaggle by data scientists including Federico Soriano Palacios, have been downloaded hundreds of thousands of times, amplifying the potential harm if institutions and practitioners continue to rely on models trained from this dubious data.
- Researchers and ethicists are calling for mandatory data-source disclosure requirements and journal rejection policies for studies lacking data provenance verification
Editorial Opinion
This investigation exposes a critical vulnerability in the AI-for-healthcare ecosystem: the ability for dubious datasets to propagate through open-access platforms and into clinical practice without adequate scrutiny. While the full extent of patient harm remains unclear, the deployment of models trained on potentially fabricated data represents a serious breach of medical ethics and patient safety. The incident underscores the urgent need for stronger governance mechanisms—including mandatory data provenance verification, journal peer-review standards that explicitly validate training data sources, and institutional accountability—before AI models are deployed in clinical decision-making.


