Anthropic's Haiku 4.5 with Skills Outperforms Opus 4.7 Without Skills in Comprehensive 880-Eval Benchmark

Key Takeaways

▸Haiku 4.5 with skills (84.3%) outperformed Opus 4.7 baseline (80.5%), proving that smaller models with proper augmentation can beat frontier models without it
▸All 88 tested configurations across 9 models showed positive performance lifts when skills were loaded, with gains ranging from +11.3 to +23.1 percentage points
▸Weaker models benefited most from skills—Haiku gained 23.1 points while Opus 4.7 gained 14 points—suggesting skills are the key to cost-effective AI deployment

Source:

Hacker Newshttps://tessl.io/blog/anthropic-openai-or-cursor-model-for-your-agent-skills-7-learnings-from-running-880-evals-including-opus-47/↗

Summary

A comprehensive evaluation of nine AI models across 880 test cases reveals that agent skills have become a decisive factor in model performance, potentially outweighing raw model capability. Anthropic's Haiku 4.5 with skills loaded achieved an 84.3% success rate, surpassing Opus 4.7's baseline 80.5% performance—demonstrating that smaller, more cost-effective models can compete with frontier models when augmented with the right skills. The benchmark tested configurations from Anthropic (Opus 4.7, Opus 4.6, Sonnet 4.6, Haiku 4.5), OpenAI (three Codex variants), and Cursor's Composer-2, with every single configuration showing positive performance gains when skills were enabled. The research suggests that as agent skills become widespread across AI ecosystems in 2026, the strategic value of context development and skill optimization may eclipse the importance of selecting the largest or most expensive model.

The cost-performance math in AI systems is shifting from pure model selection to context optimization and skill development, a trend that will likely define competitive advantage in 2026

Editorial Opinion

This benchmark result reframes how engineering teams should approach AI model selection and deployment. Rather than pursuing the most powerful or expensive model as a default, organizations can achieve superior results by pairing mid-tier models like Haiku with well-crafted skills, dramatically improving both performance and cost efficiency. The finding that every single configuration improved with skills suggests we're witnessing a fundamental shift in AI development—from a model-centric paradigm to a context-centric one where the surrounding knowledge and capabilities matter as much as the base model.

Anthropic's Haiku 4.5 with Skills Outperforms Opus 4.7 Without Skills in Comprehensive 880-Eval Benchmark

Key Takeaways

▸Haiku 4.5 with skills (84.3%) outperformed Opus 4.7 baseline (80.5%), proving that smaller models with proper augmentation can beat frontier models without it
▸All 88 tested configurations across 9 models showed positive performance lifts when skills were loaded, with gains ranging from +11.3 to +23.1 percentage points
▸Weaker models benefited most from skills—Haiku gained 23.1 points while Opus 4.7 gained 14 points—suggesting skills are the key to cost-effective AI deployment

Summary

The cost-performance math in AI systems is shifting from pure model selection to context optimization and skill development, a trend that will likely define competitive advantage in 2026

Editorial Opinion

This benchmark result reframes how engineering teams should approach AI model selection and deployment. Rather than pursuing the most powerful or expensive model as a default, organizations can achieve superior results by pairing mid-tier models like Haiku with well-crafted skills, dramatically improving both performance and cost efficiency. The finding that every single configuration improved with skills suggests we're witnessing a fundamental shift in AI development—from a model-centric paradigm to a context-centric one where the surrounding knowledge and capabilities matter as much as the base model.

Anthropic's Haiku 4.5 with Skills Outperforms Opus 4.7 Without Skills in Comprehensive 880-Eval Benchmark

Key Takeaways

Summary

Editorial Opinion

More from Anthropic

Phoenix Code Launches Claude AI Integration with Free and Pro Tiers

Anthropic Publishes First Research on Claude as Chemistry Assistant

Anthropic's Claude Matches Specialized Chemistry Software on NMR Analysis

Comments

Suggested

Phoenix Code Launches Claude AI Integration with Free and Pro Tiers

OpenAI Proposes Federal AI Safety Framework Centered on Recursive Self-Improvement

Anthropic Publishes First Research on Claude as Chemistry Assistant

Anthropic's Haiku 4.5 with Skills Outperforms Opus 4.7 Without Skills in Comprehensive 880-Eval Benchmark

Key Takeaways

Summary

Editorial Opinion

More from Anthropic

Phoenix Code Launches Claude AI Integration with Free and Pro Tiers

Anthropic Publishes First Research on Claude as Chemistry Assistant

Anthropic's Claude Matches Specialized Chemistry Software on NMR Analysis

Comments

Suggested

Phoenix Code Launches Claude AI Integration with Free and Pro Tiers

OpenAI Proposes Federal AI Safety Framework Centered on Recursive Self-Improvement

Anthropic Publishes First Research on Claude as Chemistry Assistant