Research Shows Finetuning Can Unlock Verbatim Recall of Copyrighted Content in Major LLMs
Key Takeaways
- ▸Finetuning on legitimate tasks can bypass all three layers of safety alignment (RLHF, system prompts, output filters) to unlock verbatim reproduction of copyrighted books in major LLMs
- ▸The vulnerability appears to be industry-wide: GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 all memorize the same copyrighted content in the same regions
- ▸Model weights retain latent copies of training data from pretraining that can be reactivated through finetuning on individual authors' works, contradicting company legal defenses
Summary
A new research paper demonstrates a significant vulnerability in major large language models: finetuning can bypass safety alignment measures and cause models to reproduce up to 85-90% of copyrighted books verbatim. The study tested OpenAI's GPT-4o, Google's Gemini-2.5-Pro, and DeepSeek-V3.1, finding that after training these models on tasks like expanding plot summaries into full text, they could reproduce copyrighted works with single verbatim spans exceeding 460 words using only semantic descriptions as prompts.
The researchers found that this vulnerability is not limited to specific authors or training data—finetuning exclusively on one author's works unlocked recall of copyrighted books from over 30 unrelated authors. Notably, the same books were memorized in the same regions across all three tested models, suggesting an industry-wide vulnerability. The findings indicate that model weights retain copies of copyrighted training data and that latent memorization from pretraining can be reactivated through finetuning, even after companies implemented safety measures like RLHF (Reinforcement Learning from Human Feedback), system prompts, and output filters.
These results directly challenge assurances provided by frontier LLM companies to courts and regulators that their models do not store copies of training data and that their safety alignment strategies effectively prevent verbatim reproduction of copyrighted works. The research undermines key premises in recent fair use rulings that have relied on the adequacy of measures preventing reproduction of protected expression.
- The extraction generalizes across authors and training datasets, indicating the vulnerability is systemic rather than specific to particular models or data sources
Editorial Opinion
This research raises critical questions about whether current safety alignment approaches are sufficient to address data memorization in LLMs. If finetuning can so readily bypass multiple safety layers to extract copyrighted content, it suggests that companies' legal assurances about their models' inability to reproduce protected works may be unfounded. The findings could have significant implications for ongoing copyright litigation and regulatory decisions about fair use, potentially requiring fundamentally different approaches to address memorization at the architectural or training level rather than relying solely on post-hoc alignment techniques.


