Research Reveals LLMs Corrupt Documents During Delegated Work — Major Models Fail at Reliability
Key Takeaways
- ▸Frontier LLMs (including Claude 4.6 Opus, GPT 5.4, Gemini 3.1 Pro) corrupt ~25% of document content in long delegated workflows
- ▸Document degradation is silent and severe—errors compound over time without explicit warnings to users
- ▸Agentic tool use does not mitigate corruption, and larger documents, longer interactions, and distractor files exacerbate the problem
Summary
A new arXiv research paper titled "LLMs Corrupt Your Documents When You Delegate" challenges the readiness of current AI systems for real-world delegation workflows. The study, which introduces the DELEGATE-52 benchmark, evaluates 19 LLMs across 52 professional domains (coding, crystallography, music notation, and more) in long, complex document-editing workflows. The findings are sobering: frontier models including GPT 5.4 (OpenAI), Gemini 3.1 Pro (Google), and Claude 4.6 Opus (Anthropic) corrupt an average of 25% of document content by the end of extended workflows, with other models performing even worse.
The research reveals that document degradation is not isolated to weaker models—even state-of-the-art frontier systems silently introduce sparse but severe errors that compound over time. Additional experiments show that agentic tool use does not improve performance, and that degradation worsens with document size, interaction length, and the presence of distractor files. The authors conclude that current LLMs are unreliable delegates, raising critical questions about their trustworthiness for knowledge work automation and the emerging "vibe coding" paradigm.
- Current LLMs lack reliability for trust-critical delegation tasks across professional domains like coding, legal, and scientific work
Editorial Opinion
This research is a wake-up call for the AI industry. As enterprises and developers rush to delegate real work to LLMs, the DELEGATE-52 findings expose a critical gap between capability and reliability. The 25% document corruption rate in frontier models should spark urgent focus on robustness, verification, and user safeguards—not just raw performance metrics. Until LLMs can be trusted to edit documents without silent corruption, delegation will remain a risky proposition for knowledge work.


