Microsoft Research Reveals LLMs Corrupt an Average of 25% of Documents in Long Delegated Workflows
Key Takeaways
- ▸Frontier LLMs (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt ~25% of document content in long delegated workflows, with other models degrading performance more severely
- ▸LLMs introduce sparse but severe silent errors that compound over time—critical reliability failures for delegated document editing tasks
- ▸Agentic tool use does not mitigate performance degradation; document corruption is exacerbated by size, interaction length, and presence of distractor files
Summary
Microsoft Research has published a critical benchmark study called DELEGATE-52 that evaluates the reliability of Large Language Models in delegated work scenarios—a nascent interaction paradigm where users delegate document editing tasks to AI systems. The study tested 19 LLMs across 52 professional domains including coding, crystallography, and music notation, simulating realistic long-form workflows that require in-depth document edits.
The findings are sobering: even frontier models including Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 corrupt an average of 25% of document content by the end of long workflows, while other models perform significantly worse. The research reveals that LLMs introduce sparse but severe errors that silently compound throughout interactions, fundamentally undermining their reliability as delegates. Additional analysis shows that agentic tool use fails to improve performance, and that document degradation is exacerbated by factors such as document size, interaction length, and the presence of distractor files.
These findings challenge the current industry optimism around agentic AI workflows and raise critical questions about the practical deployment of LLM-based autonomous systems in knowledge work environments where document integrity is essential.
- The DELEGATE-52 benchmark evaluates 19 LLMs across 52 professional domains, providing the first systematic assessment of LLM reliability in delegation scenarios
Editorial Opinion
This research is a crucial reality check for the industry's enthusiasm around agentic AI workflows. While LLM-powered delegation is widely discussed as the next frontier of AI interaction, Microsoft's DELEGATE-52 findings expose a fundamental limitation: today's models are simply not reliable enough to be trusted with unsupervised document modification. The silent nature of these errors—subtle corruptions that go undetected until documents are reviewed—makes this particularly problematic. Organizations piloting agentic systems must implement robust validation and review mechanisms until models demonstrate substantially higher fidelity.

