BotBeat
...
← Back

> ▌

DeepSeekDeepSeek
RESEARCHDeepSeek2026-04-01

Research Reveals Finetuning Bypasses Copyright Protections in Major LLMs, Enabling Verbatim Recall of Books

Key Takeaways

  • ▸Finetuning on commercially viable tasks (plot summary expansion) successfully bypasses alignment protections in GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1, extracting 85-90% of copyrighted books verbatim
  • ▸Model weights demonstrably store copies of training data, contradicting industry assurances to courts and regulators about data non-retention
  • ▸The vulnerability is industry-wide: identical books memorized in identical regions across models from different providers suggests systemic design flaws
Source:
Hacker Newshttps://arxiv.org/abs/2603.20957↗

Summary

A new research paper demonstrates that finetuning can bypass safety alignment measures in leading large language models, causing GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 to reproduce up to 85-90% of copyrighted books verbatim. Researchers achieved this by training models on plot summary expansion tasks—a commercially viable application—without providing actual book text, using only semantic descriptions as prompts to trigger reproduction of protected works.

The study reveals that model weights store copies of training data despite industry claims to the contrary, and that safety mechanisms including RLHF, system prompts, and output filters can be circumvented through finetuning. The effect generalizes across authors and providers: models finetuned on one author's works unlock recall of books from dozens of unrelated authors, while three major models from different companies memorize identical passages in the same locations, indicating an industry-wide vulnerability.

These findings directly challenge the legal defenses used by frontier AI companies in copyright infringement cases, particularly undermining arguments accepted by courts that safety measures adequately prevent reproduction of protected expression. The research suggests that recent fair use rulings conditioning favorable outcomes on the adequacy of such protective measures may have been based on incomplete assessments of model capabilities.

  • Generalization across authors shows that finetuning on one author's work reactivates latent memorization of unrelated works from the training corpus
  • Findings undermine legal defenses in copyright cases that relied on claims about safety measure efficacy, potentially impacting recent fair use rulings

Editorial Opinion

This research exposes a critical gap between AI companies' legal assurances and technical reality, revealing that widely-deployed safety mechanisms are far more fragile than publicly claimed. The ability to extract substantial portions of copyrighted works through seemingly innocuous finetuning tasks raises serious questions about both the integrity of previous court proceedings and the adequacy of current model governance. The industry-wide nature of this vulnerability suggests it reflects fundamental architectural issues rather than isolated oversights, demanding urgent regulatory scrutiny and potentially reconsidering how courts should weight AI company testimony about their safety capabilities.

Large Language Models (LLMs)Regulation & PolicyEthics & BiasAI Safety & AlignmentPrivacy & Data

More from DeepSeek

DeepSeekDeepSeek
RESEARCH

DeepSeek Introduces R2R: Token Routing Method Combines Small and Large Models for Efficient Reasoning

2026-04-04
DeepSeekDeepSeek
RESEARCH

From 300KB to 69KB per Token: How LLM Architectures Are Solving the KV Cache Problem

2026-03-28
DeepSeekDeepSeek
RESEARCH

Study Questions LLM Reasoning Abilities: DeepSeek R1 Shows Promise Through 3-SAT Phase Transition Analysis

2026-03-19

Comments

Suggested

OracleOracle
POLICY & REGULATION

AI Agents Promise to 'Run the Business'—But Who's Liable When Things Go Wrong?

2026-04-05
AnthropicAnthropic
POLICY & REGULATION

Anthropic Explores AI's Role in Autonomous Weapons Policy with Pentagon Discussion

2026-04-05
PerplexityPerplexity
POLICY & REGULATION

Perplexity's 'Incognito Mode' Called a 'Sham' in Class Action Lawsuit Over Data Sharing with Google and Meta

2026-04-05
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us