Microsoft Research Finds Frontier LLMs Corrupt Documents During Long Delegated Workflows
Key Takeaways
- ▸Microsoft's DELEGATE-52 benchmark reveals that even frontier LLMs corrupt approximately 25% of document content during long delegated workflows across 52 professional domains
- ▸The study evaluated 19 LLMs including Claude 4.6 Opus, Gemini 3.1 Pro, and GPT 5.4, with all models showing significant document degradation despite their advanced capabilities
- ▸Agentic tool use does not improve reliability, and document corruption worsens with file size, interaction length, and presence of distractor content, suggesting systematic issues rather than implementation problems
Summary
Microsoft researchers have published a new study introducing DELEGATE-52, a comprehensive benchmark that evaluates how reliably Large Language Models can perform autonomous document editing across 52 professional domains. The research tested 19 different LLMs, including frontier models such as Claude 4.6 Opus, Gemini 3.1 Pro, and GPT 5.4, revealing a critical limitation: even the most advanced models corrupt approximately 25% of document content by the end of long interaction workflows.
The benchmark simulates realistic delegated workflows spanning diverse professional domains including coding, crystallography, and music notation. The researchers discovered that current LLMs silently introduce sparse but severe errors that compound over extended interactions. Notably, the use of agentic tool use did not improve performance, and document degradation worsened with larger files, longer workflows, and the presence of distractor content.
These findings have significant implications for organizations considering AI-powered automation of knowledge work. The study demonstrates that current LLMs cannot be trusted for autonomous document editing without human oversight, as silent corruption could introduce subtle but damaging errors into critical professional documents. The research effectively challenges the readiness of current AI systems for true delegated workflows where humans depend on LLMs to faithfully execute complex tasks.
- Current LLMs cannot be reliably trusted for autonomous document editing and cannot serve as faithful delegates in knowledge work without human oversight and validation
Editorial Opinion
This research exposes a critical gap between frontier LLM capabilities on controlled benchmarks and their actual reliability in production workflows. The 25% document corruption rate among state-of-the-art models is alarming and should significantly temper enthusiasm for AI-driven knowledge work automation. The fact that agentic tool use provides no improvement suggests this is a fundamental limitation of current model architectures rather than a solvable engineering problem. Organizations considering delegating document editing to AI systems must treat this as a wake-up call to invest heavily in human oversight and validation mechanisms before adopting these systems at scale.



