BotBeat
...
← Back

> ▌

MicrosoftMicrosoft
RESEARCHMicrosoft2026-05-26

Microsoft Research Reveals LLMs Corrupt an Average of 25% of Documents in Long Delegated Workflows

Key Takeaways

  • ▸Frontier LLMs (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt ~25% of document content in long delegated workflows, with other models degrading performance more severely
  • ▸LLMs introduce sparse but severe silent errors that compound over time—critical reliability failures for delegated document editing tasks
  • ▸Agentic tool use does not mitigate performance degradation; document corruption is exacerbated by size, interaction length, and presence of distractor files
Source:
Hacker Newshttps://www.microsoft.com/en-us/research/publication/llms-corrupt-your-documents-when-you-delegate/↗

Summary

Microsoft Research has published a critical benchmark study called DELEGATE-52 that evaluates the reliability of Large Language Models in delegated work scenarios—a nascent interaction paradigm where users delegate document editing tasks to AI systems. The study tested 19 LLMs across 52 professional domains including coding, crystallography, and music notation, simulating realistic long-form workflows that require in-depth document edits.

The findings are sobering: even frontier models including Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 corrupt an average of 25% of document content by the end of long workflows, while other models perform significantly worse. The research reveals that LLMs introduce sparse but severe errors that silently compound throughout interactions, fundamentally undermining their reliability as delegates. Additional analysis shows that agentic tool use fails to improve performance, and that document degradation is exacerbated by factors such as document size, interaction length, and the presence of distractor files.

These findings challenge the current industry optimism around agentic AI workflows and raise critical questions about the practical deployment of LLM-based autonomous systems in knowledge work environments where document integrity is essential.

  • The DELEGATE-52 benchmark evaluates 19 LLMs across 52 professional domains, providing the first systematic assessment of LLM reliability in delegation scenarios

Editorial Opinion

This research is a crucial reality check for the industry's enthusiasm around agentic AI workflows. While LLM-powered delegation is widely discussed as the next frontier of AI interaction, Microsoft's DELEGATE-52 findings expose a fundamental limitation: today's models are simply not reliable enough to be trusted with unsupervised document modification. The silent nature of these errors—subtle corruptions that go undetected until documents are reviewed—makes this particularly problematic. Organizations piloting agentic systems must implement robust validation and review mechanisms until models demonstrate substantially higher fidelity.

Large Language Models (LLMs)AI AgentsMachine LearningAI Safety & Alignment

More from Microsoft

MicrosoftMicrosoft
RESEARCH

Microsoft's SkillOpt Treats AI Agent Skills as Trainable Parameters

2026-05-26
MicrosoftMicrosoft
PRODUCT LAUNCH

Microsoft Releases Lens: Efficient 3.8B Text-to-Image Model Rivaling Larger Competitors

2026-05-26
MicrosoftMicrosoft
PRODUCT LAUNCH

Microsoft Launches Agent Governance Toolkit: Structural Controls for Autonomous AI in Production

2026-05-26

Comments

Suggested

AnthropicAnthropic
FUNDING & BUSINESS

OpenAI and Anthropic CEOs Reverse AI Job Apocalypse Predictions Ahead of Dual IPOs

2026-05-26
AnthropicAnthropic
INDUSTRY REPORT

When AI Writes the Software, Who Verifies It? The Widening Gap Between Code Generation Speed and Verification

2026-05-26
AnthropicAnthropic
INDUSTRY REPORT

Enterprise Reality Check: Uber and Tech Giants Question AI Tool ROI as Spending Accelerates

2026-05-26
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us