BotBeat
...
← Back

> ▌

DeepSeekDeepSeek
RESEARCHDeepSeek2026-04-21

Study Reveals Large Language Models Struggle to Identify Retracted Academic Articles

Key Takeaways

  • ▸LLMs incorrectly claim retracted articles are valid over 80% of the time when working from titles and abstracts alone
  • ▸The three tested models (GPT OSS 120B, Gemma 3 27B, DeepSeek R1 72B) all performed poorly, with error rates between 82-88%
  • ▸Even when LLMs correctly identify retractions, their explanations are often inaccurate or fabricated
Source:
Hacker Newshttps://arxiv.org/abs/2604.16872↗

Summary

A new research paper submitted to arXiv examines whether large language models can accurately identify retracted scientific articles. The study tested three open-weight LLMs—GPT OSS 120B, Gemma 3 27B, and DeepSeek R1 72B—against 161 high-profile retracted articles and 34,070 non-retracted articles. The findings are concerning: over 80% of the time, LLMs incorrectly claimed that retracted articles had not been retracted, with error rates ranging from 82% to 88% across the three models.

The research reveals a critical limitation in LLM capabilities for academic research and literature review. When provided only with titles and abstracts, the models demonstrated poor ability to distinguish valid studies from retracted ones. Even when the models did correctly identify a retraction, their explanations were often inaccurate or misleading. The study also found that LLMs made false retraction claims for valid studies at a relatively low rate (55 false claims from 34,070 articles when using full text, 28 false claims using only titles and abstracts), suggesting they are unlikely to incorrectly discount valid research—but the high false-negative rate remains problematic.

The authors emphasize that LLMs cannot reliably identify retracted articles without access to online verification tools. This finding underscores the need for caution when relying on LLMs for academic literature review and fact-checking of scientific claims.

  • Models demonstrate low false-positive rates for valid articles but critically high false-negative rates for detecting retractions
  • LLMs require online access to reliably identify retracted articles; offline models cannot be trusted for this task

Editorial Opinion

This research exposes a fundamental vulnerability in using LLMs for academic research and scientific verification. The 80%+ failure rate in identifying retracted articles is alarming for researchers, students, and institutions increasingly relying on AI tools for literature review. While the low false-positive rate is reassuring, the high false-negative rate means users cannot trust LLMs to flag problematic research without independent verification. Organizations deploying LLMs in research environments must implement safeguards requiring human review and cross-checking against retraction databases.

Large Language Models (LLMs)Natural Language Processing (NLP)Science & ResearchAI Safety & AlignmentMisinformation & Deepfakes

More from DeepSeek

DeepSeekDeepSeek
RESEARCH

Physics Simulators Enable LLMs to Solve Olympiad Problems Through Reinforcement Learning

2026-04-17
DeepSeekDeepSeek
RESEARCH

DeepSeek Introduces R2R: Token Routing Method Combines Small and Large Models for Efficient Reasoning

2026-04-04
DeepSeekDeepSeek
RESEARCH

Research Reveals Finetuning Bypasses Copyright Protections in Major LLMs, Enabling Verbatim Recall of Books

2026-04-01

Comments

Suggested

AnthropicAnthropic
RESEARCH

Anthropic's Haiku 4.5 with Skills Outperforms Opus 4.7 Without Skills in Comprehensive 880-Eval Benchmark

2026-04-21
Multiple (Research Institutions)Multiple (Research Institutions)
RESEARCH

Sequential Monte Carlo Speculative Decoding Achieves 2.36x Speedup in LLM Inference

2026-04-21
N/AN/A
RESEARCH

Researchers Develop Verified Deep Learning Framework Using Lean 4 Proof Assistant

2026-04-21
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us