BotBeat
...
← Back

> ▌

Anysphere (Cursor)Anysphere (Cursor)
RESEARCHAnysphere (Cursor)2026-07-02

CursorBench 3.1 Released: New Coding Benchmark Shows Fable 5 Leads in Code Understanding and Review Tasks

Key Takeaways

  • ▸Fable 5 achieves the highest benchmark scores (70%+), with Fable 5 High balancing performance (70.6%) and cost ($10.81/task)
  • ▸Significant cost variance across models: Composer 2.5 costs $0.55/task while Fable 5 Max reaches $18.02, reflecting 30x+ price differences
  • ▸New benchmark version expands scope to include codebase understanding, bugfinding, planning, and code review—key real-world developer needs
Source:
Hacker Newshttps://cursor.com/evals↗

Summary

Cursor has released CursorBench 3.1, an updated benchmark for evaluating AI coding assistant models across real-world development tasks. The new version introduces expanded problem sets focused on codebase understanding, bugfinding, planning, and code review—representing a significant evolution from the original benchmark's focus on edit and refactor tasks.

According to the benchmark results, Anthropic's Fable 5 model dominates across all performance tiers, achieving scores above 70% on the most demanding tasks while maintaining reasonable costs. The data reveals distinct performance-cost tradeoffs across 25+ models tested, from OpenAI's GPT-5.5 to Gemini 3.5 Flash to open-source alternatives like Kimi. Notably, Fable 5 High achieves 70.6% while costing $10.81 per task, outperforming competitors in both absolute performance and cost-efficiency.

The benchmark methodology now includes improved grading criteria for edit tasks and measures three key dimensions: accuracy scores, token consumption, and cost per task. This provides developers and enterprises with concrete data to evaluate which models offer the best performance-to-cost ratio for their specific coding workflows.

  • Smaller models like Gemini 3.5 Flash show competitive cost efficiency ($1.94/task) but lag on accuracy (49.8%), creating clear low-cost alternatives
  • Benchmark incorporates published per-token pricing models, making cost comparisons directly applicable to production deployments

Editorial Opinion

CursorBench 3.1 provides the most comprehensive public evaluation of coding assistant models to date, moving beyond toy edit tasks to real developer workflows like bugfinding and codebase navigation. Fable 5's dominant performance is significant—it suggests Anthropic's latest model may finally offer the code understanding capabilities that early Cursor users have long demanded. However, the massive cost spread (from $0.55 to $18 per task) underscores an uncomfortable truth: no single model wins outright. Teams will need to choose based on their specific tolerance for latency, accuracy, and budget constraints.

Large Language Models (LLMs)AI AgentsMachine LearningData Science & Analytics

More from Anysphere (Cursor)

Anysphere (Cursor)Anysphere (Cursor)
FUNDING & BUSINESS

Cursor's $60B SpaceX Acquisition Sparks Stock Market Turmoil, Investor Skepticism

2026-06-18
Anysphere (Cursor)Anysphere (Cursor)
FUNDING & BUSINESS

Cursor Acquires Continue, Open-Source Coding Agent Platform

2026-06-18
Anysphere (Cursor)Anysphere (Cursor)
PRODUCT LAUNCH

Cursor Launches Origin, a Git Forge for the Agentic Era

2026-06-17

Comments

Suggested

Google / AlphabetGoogle / Alphabet
RESEARCH

Text AI Watermarking Faces Fundamental Technical Challenges as EU's August 2026 Deadline Approaches

2026-07-02
OpenAgentsOpenAgents
PRODUCT LAUNCH

OpenAgents Launches AI Agent Workspace for Multi-Agent Collaboration

2026-07-02
CotalCotal
PRODUCT LAUNCH

Cotal Launches Open Coordination Standard for Multi-Agent Systems

2026-07-02
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us