Research Reveals Frontier LLMs Corrupt 25% of Documents in Long Delegated Tasks
Key Takeaways
- ▸Frontier models (GPT 5.4, Gemini 3.1 Pro, Claude 4.6 Opus) corrupt ~25% of document content in long delegated workflows, with other models performing even worse
- ▸Errors are sparse yet severe and often silent, compounding across long interactions without user awareness
- ▸Agentic tool use does not improve delegation performance, suggesting the problem is fundamental to how LLMs process long edit sequences
Summary
A new academic benchmark called DELEGATE-52 has exposed critical reliability issues with current large language models, finding that frontier models from OpenAI, Google, and Anthropic corrupt approximately 25% of document content during long delegated workflows. The research tested 19 LLMs across 52 professional domains including coding, crystallography, and music notation, simulating realistic scenarios where users delegate in-depth document editing tasks to AI systems. The degradation manifests as sparse but severe errors that silently accumulate without alerting users, creating a false sense of safety while documents are being corrupted throughout extended interaction sequences.
The study found that agentic tool use does not mitigate the problem, and that document corruption worsens with larger files, longer interaction chains, and the presence of distractor files. This finding has urgent implications as AI companies move toward delegated work paradigms—particularly autonomous coding assistants—where silent document corruption could introduce production bugs or propagate errors across knowledge systems.
- Degradation increases with document size, interaction length, and presence of distractor files
- Current LLMs are unreliable for mission-critical delegated tasks and require significant improvements before deployment in production workflows
Editorial Opinion
This research challenges the optimistic narrative around AI delegation and autonomous agents. A 25% document corruption rate—even silently—is disqualifying for any safety-critical application, whether coding, legal documents, or technical specifications. The fact that this affects frontier models equally suggests the problem isn't about raw capability but about fundamental architectural limitations in how LLMs maintain consistency and accuracy over long interaction horizons. Companies deploying delegation features should take this as a critical wake-up call to implement rigorous verification mechanisms before releasing these tools to knowledge workers.

