Microsoft Research Reveals LLMs Corrupt an Average of 25% of Documents in Long Delegated Workflows

Key Takeaways

▸Frontier LLMs (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt ~25% of document content in long delegated workflows, with other models degrading performance more severely
▸LLMs introduce sparse but severe silent errors that compound over time—critical reliability failures for delegated document editing tasks
▸Agentic tool use does not mitigate performance degradation; document corruption is exacerbated by size, interaction length, and presence of distractor files

Source:

Hacker Newshttps://www.microsoft.com/en-us/research/publication/llms-corrupt-your-documents-when-you-delegate/↗

Summary

Microsoft Research has published a critical benchmark study called DELEGATE-52 that evaluates the reliability of Large Language Models in delegated work scenarios—a nascent interaction paradigm where users delegate document editing tasks to AI systems. The study tested 19 LLMs across 52 professional domains including coding, crystallography, and music notation, simulating realistic long-form workflows that require in-depth document edits.

The findings are sobering: even frontier models including Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 corrupt an average of 25% of document content by the end of long workflows, while other models perform significantly worse. The research reveals that LLMs introduce sparse but severe errors that silently compound throughout interactions, fundamentally undermining their reliability as delegates. Additional analysis shows that agentic tool use fails to improve performance, and that document degradation is exacerbated by factors such as document size, interaction length, and the presence of distractor files.

These findings challenge the current industry optimism around agentic AI workflows and raise critical questions about the practical deployment of LLM-based autonomous systems in knowledge work environments where document integrity is essential.

The DELEGATE-52 benchmark evaluates 19 LLMs across 52 professional domains, providing the first systematic assessment of LLM reliability in delegation scenarios

Editorial Opinion

This research is a crucial reality check for the industry's enthusiasm around agentic AI workflows. While LLM-powered delegation is widely discussed as the next frontier of AI interaction, Microsoft's DELEGATE-52 findings expose a fundamental limitation: today's models are simply not reliable enough to be trusted with unsupervised document modification. The silent nature of these errors—subtle corruptions that go undetected until documents are reviewed—makes this particularly problematic. Organizations piloting agentic systems must implement robust validation and review mechanisms until models demonstrate substantially higher fidelity.

Microsoft Research Reveals LLMs Corrupt an Average of 25% of Documents in Long Delegated Workflows

Key Takeaways

▸Frontier LLMs (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt ~25% of document content in long delegated workflows, with other models degrading performance more severely
▸LLMs introduce sparse but severe silent errors that compound over time—critical reliability failures for delegated document editing tasks
▸Agentic tool use does not mitigate performance degradation; document corruption is exacerbated by size, interaction length, and presence of distractor files

Summary

The DELEGATE-52 benchmark evaluates 19 LLMs across 52 professional domains, providing the first systematic assessment of LLM reliability in delegation scenarios

Editorial Opinion

This research is a crucial reality check for the industry's enthusiasm around agentic AI workflows. While LLM-powered delegation is widely discussed as the next frontier of AI interaction, Microsoft's DELEGATE-52 findings expose a fundamental limitation: today's models are simply not reliable enough to be trusted with unsupervised document modification. The silent nature of these errors—subtle corruptions that go undetected until documents are reviewed—makes this particularly problematic. Organizations piloting agentic systems must implement robust validation and review mechanisms until models demonstrate substantially higher fidelity.

Microsoft Research Reveals LLMs Corrupt an Average of 25% of Documents in Long Delegated Workflows

Key Takeaways

Summary

Editorial Opinion

More from Microsoft

Microsoft's 2026 Sustainability Report Faces New Reality: Balancing AI Growth with Environmental Responsibility

Microsoft Leads Industry Shift to In-House AI Models as Tech Companies Slash AI Costs

Microsoft Launches Flint: An Open-Source Visualization Language Designed for AI Agents

Comments

Suggested

AI2Web Launches Unified Protocol Layer for AI-Enabled Websites

Alethea Research: State Actors Deploy AI-Generated Content in Coordinated Data Center Disinformation Campaign

OpenAI Introduces GPT-5.6 Luna: Healthcare-Focused Model Delivers 25x Cost Reduction

Microsoft Research Reveals LLMs Corrupt an Average of 25% of Documents in Long Delegated Workflows

Key Takeaways

Summary

Editorial Opinion

More from Microsoft

Microsoft's 2026 Sustainability Report Faces New Reality: Balancing AI Growth with Environmental Responsibility

Microsoft Leads Industry Shift to In-House AI Models as Tech Companies Slash AI Costs

Microsoft Launches Flint: An Open-Source Visualization Language Designed for AI Agents

Comments

Suggested

AI2Web Launches Unified Protocol Layer for AI-Enabled Websites

Alethea Research: State Actors Deploy AI-Generated Content in Coordinated Data Center Disinformation Campaign

OpenAI Introduces GPT-5.6 Luna: Healthcare-Focused Model Delivers 25x Cost Reduction