Microsoft Study Reveals AI Models Fail at Long-Running Tasks, Losing 25% of Content

Key Takeaways

▸Frontier AI models (Claude 4.6, GPT 5.4, Gemini 3.1 Pro) lose an average of 25% of document content over 20 multistep interactions
▸Only Python programming was deemed ready for delegated workflows; 80% of model/domain combinations showed catastrophic corruption (80%+ degradation)
▸Stronger models don't avoid errors better—they delay critical failures to later rounds, resulting in sudden data loss rather than gradual degradation

Source:

Hacker Newshttps://www.theregister.com/ai-ml/2026/05/11/microsoft-researchers-find-ai-models-and-agents-cant-handle-long-running-tasks/5238263↗

Summary

Microsoft researchers have published a preprint paper titled "LLMs Corrupt Your Documents When You Delegate," presenting troubling findings about the ability of large language models to handle multistep workflows. The study, conducted by researchers Philippe Laban, Tobias Schnabel, and Jennifer Neville, tests frontier models including Anthropic's Claude 4.6 Opus, OpenAI's GPT 5.4, and Google's Gemini 3.1 Pro on a new benchmark called DELEGATE-52, which simulates multistep workflows across 52 professional domains such as accounting, crystallography, and music notation.

The results are alarming: frontier models lose an average of 25 percent of document content over 20 delegated interactions, with degradation rates averaging 50 percent across all models tested. Only one domain—Python programming—met the researchers' "ready" threshold of 98 percent or higher accuracy after 20 interactions. The study found "catastrophic corruption" (80 percent or less accuracy) in more than 80 percent of model/domain combinations, with the best-performing model, Google Gemini 3.1 Pro, qualifying as ready for only 11 of 52 domains tested.

These findings directly challenge marketing claims from major AI providers. Anthropic promotes Claude for autonomous task completion, while Microsoft touts Microsoft 365 Copilot's ability to handle complex multistep research tasks. The research suggests that frontier models are simply not reliable enough for production use in most professional domains, with stronger models tending to delay critical failures rather than prevent them, resulting in sudden loss of 10-30 percent of data in single interactions.

Current LLMs are not ready for autonomous workflows in the vast majority of professional domains, contradicting vendor marketing claims

Editorial Opinion

This research is a crucial reality check for the AI industry, which has been overselling autonomous agent capabilities. The finding that even frontier models catastrophically fail in 80% of tested scenarios should give any organization considering delegating critical workflows to AI agents serious pause. However, the research also suggests the problem may not be fundamental but rather an issue of model reliability—further work on techniques to prevent document corruption could make long-running task delegation viable in the near term.

Microsoft Study Reveals AI Models Fail at Long-Running Tasks, Losing 25% of Content

Key Takeaways

▸Frontier AI models (Claude 4.6, GPT 5.4, Gemini 3.1 Pro) lose an average of 25% of document content over 20 multistep interactions
▸Only Python programming was deemed ready for delegated workflows; 80% of model/domain combinations showed catastrophic corruption (80%+ degradation)
▸Stronger models don't avoid errors better—they delay critical failures to later rounds, resulting in sudden data loss rather than gradual degradation

Summary

Current LLMs are not ready for autonomous workflows in the vast majority of professional domains, contradicting vendor marketing claims

Editorial Opinion

This research is a crucial reality check for the AI industry, which has been overselling autonomous agent capabilities. The finding that even frontier models catastrophically fail in 80% of tested scenarios should give any organization considering delegating critical workflows to AI agents serious pause. However, the research also suggests the problem may not be fundamental but rather an issue of model reliability—further work on techniques to prevent document corruption could make long-running task delegation viable in the near term.

Microsoft Study Reveals AI Models Fail at Long-Running Tasks, Losing 25% of Content

Key Takeaways

Summary

Editorial Opinion

More from Microsoft

Microsoft's $1 Billion Kenya Data Center Stalls Over Power Constraints

GitHub Copilot Deprecates Grok Code Fast 1 Model Effective May 15

Microsoft's Copilot Builder Enables AI-Powered Welding with Miller Welds' Blue iQ

Comments

Suggested

Anthropic Releases Prempti: Open-Source Guardrails for AI Coding Agents

mm-ctx: Open-Source Multimodal CLI Toolkit Brings Vision Capabilities to AI Agents

Anthropic Unleashes Computer Use: Claude 3.5 Sonnet Now Controls Your Desktop

Microsoft Study Reveals AI Models Fail at Long-Running Tasks, Losing 25% of Content

Key Takeaways

Summary

Editorial Opinion

More from Microsoft

Microsoft's $1 Billion Kenya Data Center Stalls Over Power Constraints

GitHub Copilot Deprecates Grok Code Fast 1 Model Effective May 15

Microsoft's Copilot Builder Enables AI-Powered Welding with Miller Welds' Blue iQ

Comments

Suggested

Anthropic Releases Prempti: Open-Source Guardrails for AI Coding Agents

mm-ctx: Open-Source Multimodal CLI Toolkit Brings Vision Capabilities to AI Agents

Anthropic Unleashes Computer Use: Claude 3.5 Sonnet Now Controls Your Desktop