BotBeat
...
← Back

> ▌

MicrosoftMicrosoft
RESEARCHMicrosoft2026-05-12

Microsoft Study Reveals AI Models Fail at Long-Running Tasks, Losing 25% of Content

Key Takeaways

  • ▸Frontier AI models (Claude 4.6, GPT 5.4, Gemini 3.1 Pro) lose an average of 25% of document content over 20 multistep interactions
  • ▸Only Python programming was deemed ready for delegated workflows; 80% of model/domain combinations showed catastrophic corruption (80%+ degradation)
  • ▸Stronger models don't avoid errors better—they delay critical failures to later rounds, resulting in sudden data loss rather than gradual degradation
Source:
Hacker Newshttps://www.theregister.com/ai-ml/2026/05/11/microsoft-researchers-find-ai-models-and-agents-cant-handle-long-running-tasks/5238263↗

Summary

Microsoft researchers have published a preprint paper titled "LLMs Corrupt Your Documents When You Delegate," presenting troubling findings about the ability of large language models to handle multistep workflows. The study, conducted by researchers Philippe Laban, Tobias Schnabel, and Jennifer Neville, tests frontier models including Anthropic's Claude 4.6 Opus, OpenAI's GPT 5.4, and Google's Gemini 3.1 Pro on a new benchmark called DELEGATE-52, which simulates multistep workflows across 52 professional domains such as accounting, crystallography, and music notation.

The results are alarming: frontier models lose an average of 25 percent of document content over 20 delegated interactions, with degradation rates averaging 50 percent across all models tested. Only one domain—Python programming—met the researchers' "ready" threshold of 98 percent or higher accuracy after 20 interactions. The study found "catastrophic corruption" (80 percent or less accuracy) in more than 80 percent of model/domain combinations, with the best-performing model, Google Gemini 3.1 Pro, qualifying as ready for only 11 of 52 domains tested.

These findings directly challenge marketing claims from major AI providers. Anthropic promotes Claude for autonomous task completion, while Microsoft touts Microsoft 365 Copilot's ability to handle complex multistep research tasks. The research suggests that frontier models are simply not reliable enough for production use in most professional domains, with stronger models tending to delay critical failures rather than prevent them, resulting in sudden loss of 10-30 percent of data in single interactions.

  • Current LLMs are not ready for autonomous workflows in the vast majority of professional domains, contradicting vendor marketing claims

Editorial Opinion

This research is a crucial reality check for the AI industry, which has been overselling autonomous agent capabilities. The finding that even frontier models catastrophically fail in 80% of tested scenarios should give any organization considering delegating critical workflows to AI agents serious pause. However, the research also suggests the problem may not be fundamental but rather an issue of model reliability—further work on techniques to prevent document corruption could make long-running task delegation viable in the near term.

Large Language Models (LLMs)Generative AIReinforcement LearningAI AgentsScience & ResearchAI Safety & Alignment

More from Microsoft

MicrosoftMicrosoft
PRODUCT LAUNCH

Microsoft Launches DirectX Dump Files Public Preview for Cross-Vendor GPU Debugging

2026-06-19
MicrosoftMicrosoft
UPDATE

GitHub Copilot Reopens Individual Plan Sign-Ups with Flexible Usage Management Features

2026-06-17
MicrosoftMicrosoft
RESEARCH

Researchers Expose Critical Microsoft Copilot Vulnerability Bypassing Security to Steal 2FA Codes

2026-06-16

Comments

Suggested

Z.aiZ.ai
PRODUCT LAUNCH

Z.ai Launches GLM-5.2, Claims Fable 5-Class Model Coming Within Months

2026-06-20
Moebius Research ProjectMoebius Research Project
RESEARCH

Moebius: Lightweight Image Inpainting Framework Achieves 10B-Level Quality with Just 0.2B Parameters

2026-06-20
InceptionInception
PRODUCT LAUNCH

Inception Unveils Mercury 2: Parallel-Token Diffusion Models Reshape LLM Performance Economics

2026-06-20
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us