CursorBench 3.1 Released: New Coding Benchmark Shows Fable 5 Leads in Code Understanding and Review Tasks
Key Takeaways
- ▸Fable 5 achieves the highest benchmark scores (70%+), with Fable 5 High balancing performance (70.6%) and cost ($10.81/task)
- ▸Significant cost variance across models: Composer 2.5 costs $0.55/task while Fable 5 Max reaches $18.02, reflecting 30x+ price differences
- ▸New benchmark version expands scope to include codebase understanding, bugfinding, planning, and code review—key real-world developer needs
Summary
Cursor has released CursorBench 3.1, an updated benchmark for evaluating AI coding assistant models across real-world development tasks. The new version introduces expanded problem sets focused on codebase understanding, bugfinding, planning, and code review—representing a significant evolution from the original benchmark's focus on edit and refactor tasks.
According to the benchmark results, Anthropic's Fable 5 model dominates across all performance tiers, achieving scores above 70% on the most demanding tasks while maintaining reasonable costs. The data reveals distinct performance-cost tradeoffs across 25+ models tested, from OpenAI's GPT-5.5 to Gemini 3.5 Flash to open-source alternatives like Kimi. Notably, Fable 5 High achieves 70.6% while costing $10.81 per task, outperforming competitors in both absolute performance and cost-efficiency.
The benchmark methodology now includes improved grading criteria for edit tasks and measures three key dimensions: accuracy scores, token consumption, and cost per task. This provides developers and enterprises with concrete data to evaluate which models offer the best performance-to-cost ratio for their specific coding workflows.
- Smaller models like Gemini 3.5 Flash show competitive cost efficiency ($1.94/task) but lag on accuracy (49.8%), creating clear low-cost alternatives
- Benchmark incorporates published per-token pricing models, making cost comparisons directly applicable to production deployments
Editorial Opinion
CursorBench 3.1 provides the most comprehensive public evaluation of coding assistant models to date, moving beyond toy edit tasks to real developer workflows like bugfinding and codebase navigation. Fable 5's dominant performance is significant—it suggests Anthropic's latest model may finally offer the code understanding capabilities that early Cursor users have long demanded. However, the massive cost spread (from $0.55 to $18 per task) underscores an uncomfortable truth: no single model wins outright. Teams will need to choose based on their specific tolerance for latency, accuracy, and budget constraints.



