Study Reveals Large Language Models Struggle to Identify Retracted Academic Articles
Key Takeaways
- ▸LLMs incorrectly claim retracted articles are valid over 80% of the time when working from titles and abstracts alone
- ▸The three tested models (GPT OSS 120B, Gemma 3 27B, DeepSeek R1 72B) all performed poorly, with error rates between 82-88%
- ▸Even when LLMs correctly identify retractions, their explanations are often inaccurate or fabricated
Summary
A new research paper submitted to arXiv examines whether large language models can accurately identify retracted scientific articles. The study tested three open-weight LLMs—GPT OSS 120B, Gemma 3 27B, and DeepSeek R1 72B—against 161 high-profile retracted articles and 34,070 non-retracted articles. The findings are concerning: over 80% of the time, LLMs incorrectly claimed that retracted articles had not been retracted, with error rates ranging from 82% to 88% across the three models.
The research reveals a critical limitation in LLM capabilities for academic research and literature review. When provided only with titles and abstracts, the models demonstrated poor ability to distinguish valid studies from retracted ones. Even when the models did correctly identify a retraction, their explanations were often inaccurate or misleading. The study also found that LLMs made false retraction claims for valid studies at a relatively low rate (55 false claims from 34,070 articles when using full text, 28 false claims using only titles and abstracts), suggesting they are unlikely to incorrectly discount valid research—but the high false-negative rate remains problematic.
The authors emphasize that LLMs cannot reliably identify retracted articles without access to online verification tools. This finding underscores the need for caution when relying on LLMs for academic literature review and fact-checking of scientific claims.
- Models demonstrate low false-positive rates for valid articles but critically high false-negative rates for detecting retractions
- LLMs require online access to reliably identify retracted articles; offline models cannot be trusted for this task
Editorial Opinion
This research exposes a fundamental vulnerability in using LLMs for academic research and scientific verification. The 80%+ failure rate in identifying retracted articles is alarming for researchers, students, and institutions increasingly relying on AI tools for literature review. While the low false-positive rate is reassuring, the high false-negative rate means users cannot trust LLMs to flag problematic research without independent verification. Organizations deploying LLMs in research environments must implement safeguards requiring human review and cross-checking against retraction databases.



