New Benchmark Reveals AI Agents Struggle to Automate Real-World Remote Work
Key Takeaways
- ▸The Remote Labor Index (RLI) tests AI agents on real remote work projects worth over $140,000 and representing 6,000+ hours of human labor, spanning industries from game development to data analysis
- ▸State-of-the-art AI agents currently perform near the floor on these tasks, with very low automation rates, failing to complete most projects at professionally acceptable quality levels
- ▸While current performance is low, the benchmark shows measurable progress across models, providing a framework to track the actual pace of AI-driven labor automation
Summary
The Center for AI Safety, in collaboration with Scale AI, has released the Remote Labor Index (RLI), a comprehensive benchmark designed to measure how well AI agents can automate actual remote work projects. The benchmark comprises real-world, economically valuable tasks spanning game development, product design, architecture, data analysis, and video animation—representing over 6,000 hours of work valued at more than $140,000. Unlike traditional AI benchmarks focused on knowledge and reasoning, RLI tests end-to-end agent performance on complete projects that human professionals have actually completed for clients.
The results reveal a stark reality: state-of-the-art AI agents perform near the floor on these tasks, with even the best-performing models achieving very low automation rates. The research demonstrates that contemporary AI systems fail to complete the vast majority of projects at a quality level that would be acceptable as commissioned work. Projects in the benchmark vary widely in complexity, with some costing over $10,000 and requiring more than 100 hours of human labor to complete.
Despite the low absolute performance, the researchers note that AI models are showing steady improvement, and progress on these complex tasks is measurable. The benchmark is designed to provide a common basis for tracking the trajectory of AI automation capabilities over time. By grounding discussions of AI-driven labor automation in empirical evidence, RLI aims to help stakeholders—including workers, employers, and policymakers—better understand and prepare for the actual pace and scope of AI's impact on the labor market.
The research paper, authored by a large team including researchers from the Center for AI Safety and Scale AI, is now available along with code on GitHub. The project includes a live leaderboard tracking model performance, enabling ongoing assessment of how AI capabilities evolve on real-world work tasks rather than academic benchmarks.
- RLI aims to ground AI automation discussions in empirical evidence, helping stakeholders move beyond speculation to understand real-world capabilities and prepare accordingly
Editorial Opinion
The Remote Labor Index represents a crucial reality check for the AI industry and society at large. While AI systems have achieved impressive results on academic benchmarks and narrow tasks, this research reveals a significant gap between benchmark performance and the ability to deliver real economic value through complete, professional-quality work. The sobering results—showing frontier AI agents struggling with tasks that human freelancers routinely complete—suggest that fears of imminent widespread job displacement may be premature, even as the benchmark's design enables us to track when that picture changes. This kind of grounded, empirical assessment is exactly what's needed to replace both hype and anxiety with actionable intelligence.


