Microsoft Study Reveals AI Models Fail at Long-Running Tasks, Losing 25% of Content
Key Takeaways
- ▸Frontier AI models (Claude 4.6, GPT 5.4, Gemini 3.1 Pro) lose an average of 25% of document content over 20 multistep interactions
- ▸Only Python programming was deemed ready for delegated workflows; 80% of model/domain combinations showed catastrophic corruption (80%+ degradation)
- ▸Stronger models don't avoid errors better—they delay critical failures to later rounds, resulting in sudden data loss rather than gradual degradation
Summary
Microsoft researchers have published a preprint paper titled "LLMs Corrupt Your Documents When You Delegate," presenting troubling findings about the ability of large language models to handle multistep workflows. The study, conducted by researchers Philippe Laban, Tobias Schnabel, and Jennifer Neville, tests frontier models including Anthropic's Claude 4.6 Opus, OpenAI's GPT 5.4, and Google's Gemini 3.1 Pro on a new benchmark called DELEGATE-52, which simulates multistep workflows across 52 professional domains such as accounting, crystallography, and music notation.
The results are alarming: frontier models lose an average of 25 percent of document content over 20 delegated interactions, with degradation rates averaging 50 percent across all models tested. Only one domain—Python programming—met the researchers' "ready" threshold of 98 percent or higher accuracy after 20 interactions. The study found "catastrophic corruption" (80 percent or less accuracy) in more than 80 percent of model/domain combinations, with the best-performing model, Google Gemini 3.1 Pro, qualifying as ready for only 11 of 52 domains tested.
These findings directly challenge marketing claims from major AI providers. Anthropic promotes Claude for autonomous task completion, while Microsoft touts Microsoft 365 Copilot's ability to handle complex multistep research tasks. The research suggests that frontier models are simply not reliable enough for production use in most professional domains, with stronger models tending to delay critical failures rather than prevent them, resulting in sudden loss of 10-30 percent of data in single interactions.
- Current LLMs are not ready for autonomous workflows in the vast majority of professional domains, contradicting vendor marketing claims
Editorial Opinion
This research is a crucial reality check for the AI industry, which has been overselling autonomous agent capabilities. The finding that even frontier models catastrophically fail in 80% of tested scenarios should give any organization considering delegating critical workflows to AI agents serious pause. However, the research also suggests the problem may not be fundamental but rather an issue of model reliability—further work on techniques to prevent document corruption could make long-running task delegation viable in the near term.


