BotBeat
...
← Back

> ▌

DeepSeekDeepSeek
RESEARCHDeepSeek2026-04-21

Study Reveals Large Language Models Struggle to Identify Retracted Academic Articles

Key Takeaways

  • ▸LLMs incorrectly claim retracted articles are valid over 80% of the time when working from titles and abstracts alone
  • ▸The three tested models (GPT OSS 120B, Gemma 3 27B, DeepSeek R1 72B) all performed poorly, with error rates between 82-88%
  • ▸Even when LLMs correctly identify retractions, their explanations are often inaccurate or fabricated
Source:
Hacker Newshttps://arxiv.org/abs/2604.16872↗

Summary

A new research paper submitted to arXiv examines whether large language models can accurately identify retracted scientific articles. The study tested three open-weight LLMs—GPT OSS 120B, Gemma 3 27B, and DeepSeek R1 72B—against 161 high-profile retracted articles and 34,070 non-retracted articles. The findings are concerning: over 80% of the time, LLMs incorrectly claimed that retracted articles had not been retracted, with error rates ranging from 82% to 88% across the three models.

The research reveals a critical limitation in LLM capabilities for academic research and literature review. When provided only with titles and abstracts, the models demonstrated poor ability to distinguish valid studies from retracted ones. Even when the models did correctly identify a retraction, their explanations were often inaccurate or misleading. The study also found that LLMs made false retraction claims for valid studies at a relatively low rate (55 false claims from 34,070 articles when using full text, 28 false claims using only titles and abstracts), suggesting they are unlikely to incorrectly discount valid research—but the high false-negative rate remains problematic.

The authors emphasize that LLMs cannot reliably identify retracted articles without access to online verification tools. This finding underscores the need for caution when relying on LLMs for academic literature review and fact-checking of scientific claims.

  • Models demonstrate low false-positive rates for valid articles but critically high false-negative rates for detecting retractions
  • LLMs require online access to reliably identify retracted articles; offline models cannot be trusted for this task

Editorial Opinion

This research exposes a fundamental vulnerability in using LLMs for academic research and scientific verification. The 80%+ failure rate in identifying retracted articles is alarming for researchers, students, and institutions increasingly relying on AI tools for literature review. While the low false-positive rate is reassuring, the high false-negative rate means users cannot trust LLMs to flag problematic research without independent verification. Organizations deploying LLMs in research environments must implement safeguards requiring human review and cross-checking against retraction databases.

Large Language Models (LLMs)Natural Language Processing (NLP)Science & ResearchAI Safety & AlignmentMisinformation & Deepfakes

More from DeepSeek

DeepSeekDeepSeek
INDUSTRY REPORT

US Companies Increasingly Adopt Chinese AI Model DeepSeek to Cut Costs

2026-06-04
DeepSeekDeepSeek
RESEARCH

DeepSeek Leads in Security Exploit Challenge Across LLM Providers

2026-06-04
DeepSeekDeepSeek
INDUSTRY REPORT

China's AI Valuation Boom: Are Billion-Dollar Unicorns Built on Substance or Speculation?

2026-05-30

Comments

Suggested

OllamaOllama
RESEARCH

Critical Unpatched Vulnerabilities in Ollama Desktop App Enable Phishing and Data Exfiltration

2026-06-05
AnthropicAnthropic
RESEARCH

Anthropic's Claude Matches Specialized Chemistry Software on NMR Analysis

2026-06-05
Research CommunityResearch Community
RESEARCH

Researchers Demonstrate Autonomous LLM Agents for Photonic Chip Design

2026-06-05
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us