METR Metrics Show AI Task-Completion Ability Doubling Every 7 Months; One-Month Horizon Expected by 2029
Key Takeaways
- ▸METR introduces the '50%-task-completion time horizon' metric, measuring software task complexity AI models can handle—a clearer measure of progress than raw benchmark scores
- ▸AI task-completion ability has been doubling every 7 months; o3 now reaches 110 minutes vs. GPT-2's 2 seconds, with one-month horizon projected for mid-2029 (central estimate)
- ▸Critical caveat: 80% success rate is 4-6x shorter than 50% rate, and performance collapses in messy, real-world environments without clear feedback mechanisms or documentation
Summary
METR (Model Evaluation & Threat Research) has introduced a new benchmark metric called the "50%-task-completion time horizon," which measures the length of software engineering tasks that frontier AI models can complete with 50% success rate. Evaluating 12 frontier models across 170 tasks and over 800 human developer baselines, the research reveals a striking trend: this time horizon has been doubling every 7 months since 2019, with the o3 model now reaching a 110-minute horizon compared to GPT-2's 2 seconds. If this exponential trend continues, AI systems could reach a one-month time horizon (equivalent to 167 working hours of skilled developer effort) between mid-2028 and mid-2031, with a central estimate of mid-2029.
However, METR's analysis reveals critical caveats that temper these headline numbers. The 50% success rate masks a significant reliability gap: at 80% success rate, the time horizon shrinks 4-6x shorter, meaning AI can reliably complete week-long tasks rather than month-long ones. Performance degrades sharply in messy environments that lack clear feedback mechanisms like automated tests, and the metrics are calibrated against general-purpose developers, not domain experts. AI performance on real pull requests aligns more closely with low-context contractors (5-18x slower than expert maintainers) than with experienced developers, suggesting the effective timeline for complex legacy systems remains far shorter.
The research also highlights that well-structured, well-tested codebases will see far greater gains from AI automation than legacy systems with poor documentation. This suggests an uneven near-term impact: greenfield projects and new feature development could be transformed by autonomous AI agents, while deep maintenance, debugging of subtle production issues, and evolution of tightly-coupled legacy systems will remain primarily human work. The study's methodology is rigorous and reproducible, with validation across multiple benchmarks including SWE-bench, though the task distribution remains skewed toward isolated coding challenges rather than the full scope of real-world software engineering practices like architectural decision-making and live operations.
- AI agents perform at low-context contractor level (5-18x slower than expert maintainers), suggesting significant gaps for legacy systems and domain-specific tasks requiring institutional knowledge
- Greenfield work and well-tested codebases will see rapid gains, while maintenance, debugging, and deeply coupled legacy systems will remain harder to automate for years to come
Editorial Opinion
METR's research makes a valuable contribution by centering AI progress on a practical, human-calibrated metric rather than abstract benchmark scores. The exponential trend is genuinely striking, but the reliability gap between 50% and 80% success rates deserves equal attention—an AI system that sometimes completes a month of work is useful primarily for greenfield projects, not the maintenance and debugging that consumes much of real software engineering. The near-term reality is likely a sharp productivity divide: teams building new systems from scratch will see transformative gains, while teams maintaining complex legacy codebases will realize much more modest improvements.



