Study Reveals Large Language Models Struggle to Identify Retracted Academic Articles

Key Takeaways

▸LLMs incorrectly claim retracted articles are valid over 80% of the time when working from titles and abstracts alone
▸The three tested models (GPT OSS 120B, Gemma 3 27B, DeepSeek R1 72B) all performed poorly, with error rates between 82-88%
▸Even when LLMs correctly identify retractions, their explanations are often inaccurate or fabricated

Source:

Hacker Newshttps://arxiv.org/abs/2604.16872↗

Summary

A new research paper submitted to arXiv examines whether large language models can accurately identify retracted scientific articles. The study tested three open-weight LLMs—GPT OSS 120B, Gemma 3 27B, and DeepSeek R1 72B—against 161 high-profile retracted articles and 34,070 non-retracted articles. The findings are concerning: over 80% of the time, LLMs incorrectly claimed that retracted articles had not been retracted, with error rates ranging from 82% to 88% across the three models.

The research reveals a critical limitation in LLM capabilities for academic research and literature review. When provided only with titles and abstracts, the models demonstrated poor ability to distinguish valid studies from retracted ones. Even when the models did correctly identify a retraction, their explanations were often inaccurate or misleading. The study also found that LLMs made false retraction claims for valid studies at a relatively low rate (55 false claims from 34,070 articles when using full text, 28 false claims using only titles and abstracts), suggesting they are unlikely to incorrectly discount valid research—but the high false-negative rate remains problematic.

The authors emphasize that LLMs cannot reliably identify retracted articles without access to online verification tools. This finding underscores the need for caution when relying on LLMs for academic literature review and fact-checking of scientific claims.

Models demonstrate low false-positive rates for valid articles but critically high false-negative rates for detecting retractions
LLMs require online access to reliably identify retracted articles; offline models cannot be trusted for this task

Editorial Opinion

This research exposes a fundamental vulnerability in using LLMs for academic research and scientific verification. The 80%+ failure rate in identifying retracted articles is alarming for researchers, students, and institutions increasingly relying on AI tools for literature review. While the low false-positive rate is reassuring, the high false-negative rate means users cannot trust LLMs to flag problematic research without independent verification. Organizations deploying LLMs in research environments must implement safeguards requiring human review and cross-checking against retraction databases.

Study Reveals Large Language Models Struggle to Identify Retracted Academic Articles

Key Takeaways

▸LLMs incorrectly claim retracted articles are valid over 80% of the time when working from titles and abstracts alone
▸The three tested models (GPT OSS 120B, Gemma 3 27B, DeepSeek R1 72B) all performed poorly, with error rates between 82-88%
▸Even when LLMs correctly identify retractions, their explanations are often inaccurate or fabricated

Summary

Models demonstrate low false-positive rates for valid articles but critically high false-negative rates for detecting retractions
LLMs require online access to reliably identify retracted articles; offline models cannot be trusted for this task

Editorial Opinion

This research exposes a fundamental vulnerability in using LLMs for academic research and scientific verification. The 80%+ failure rate in identifying retracted articles is alarming for researchers, students, and institutions increasingly relying on AI tools for literature review. While the low false-positive rate is reassuring, the high false-negative rate means users cannot trust LLMs to flag problematic research without independent verification. Organizations deploying LLMs in research environments must implement safeguards requiring human review and cross-checking against retraction databases.

Study Reveals Large Language Models Struggle to Identify Retracted Academic Articles

Key Takeaways

Summary

Editorial Opinion

More from DeepSeek

US Companies Increasingly Adopt Chinese AI Model DeepSeek to Cut Costs

DeepSeek Leads in Security Exploit Challenge Across LLM Providers

China's AI Valuation Boom: Are Billion-Dollar Unicorns Built on Substance or Speculation?

Comments

Suggested

Phoenix Code Launches Claude AI Integration with Free and Pro Tiers

OpenAI Proposes Federal AI Safety Framework Centered on Recursive Self-Improvement

Anthropic Publishes First Research on Claude as Chemistry Assistant

Study Reveals Large Language Models Struggle to Identify Retracted Academic Articles

Key Takeaways

Summary

Editorial Opinion

More from DeepSeek

US Companies Increasingly Adopt Chinese AI Model DeepSeek to Cut Costs

DeepSeek Leads in Security Exploit Challenge Across LLM Providers

China's AI Valuation Boom: Are Billion-Dollar Unicorns Built on Substance or Speculation?

Comments

Suggested

Phoenix Code Launches Claude AI Integration with Free and Pro Tiers

OpenAI Proposes Federal AI Safety Framework Centered on Recursive Self-Improvement

Anthropic Publishes First Research on Claude as Chemistry Assistant