BotBeat
...
← Back

> ▌

AnthropicAnthropic
RESEARCHAnthropic2026-05-29

CVE-Bench: New Benchmark Tests Whether AI Can Actually Fix Real-World Security Vulnerabilities

Key Takeaways

  • ▸CVE-Bench is the first benchmark specifically designed to test AI agents on fixing real-world security vulnerabilities
  • ▸Testing reveals that fixing vulnerabilities is significantly more complex than identifying them, despite AI claims of security expertise
  • ▸Many open-source security fixes lack proper test coverage, limiting validation of AI-generated patches
Source:
Hacker Newshttps://giovannigatti.github.io/cve-bench/↗

Summary

CVE-Bench is a new research benchmark designed to test whether AI agents can fix real-world security vulnerabilities. Created by researcher logickkk1, the benchmark includes 20 CVEs from 18 diverse Python projects (including Pillow, GitPython, yt-dlp, and urllib3), testing five AI models across three prompt conditions. Each agent runs in a sandboxed container and is scored against maintainer security tests, covering vulnerabilities with CVSS scores from 2.1 to 9.8 across 15 different Common Weakness Enumeration (CWE) categories.

The creation of CVE-Bench was prompted by Anthropic's early 2026 claim that their Mythos model finds security vulnerabilities better than human experts—yet security vulnerabilities continue to rise. The benchmark tests multiple AI models including those from Anthropic and Poolside (Laguna), providing rigorous evaluation of whether AI can actually fix the problems it claims to identify. Critically, the research reveals that fixing vulnerabilities is significantly more complex than identifying them, with many open-source projects failing to include proper test coverage for security patches.

CVE-Bench addresses a critical gap in AI evaluation benchmarking. While general code-fixing benchmarks like SWE-Bench exist, this is the first systematic benchmark specifically for security vulnerability patching using real-world CVEs and actual maintainer test suites. The benchmark provides vulnerable and fixed git SHAs, Docker container setup scripts, and manually curated security tests that expose vulnerabilities without revealing fixes, creating a reproducible environment for evaluating AI security capabilities.

  • The benchmark tests multiple AI models across diverse Python projects with CVSS scores ranging from 2.1 to 9.8
  • Real-world benchmark results show that vulnerability fixing requires understanding domain-specific API changes and security implications

Editorial Opinion

CVE-Bench is a crucial contribution to AI safety evaluation at a critical moment. As models like Anthropic's Mythos make increasingly bold claims about security capabilities, we need rigorous real-world benchmarks to ground expectations in reality. The gap between identifying vulnerabilities and actually fixing them—especially when open-source maintainers often skip proper security testing—exposes both an AI limitation and a systemic software engineering problem. This benchmark could become essential infrastructure for the AI safety community.

Large Language Models (LLMs)AI AgentsCybersecurityAI Safety & Alignment

More from Anthropic

AnthropicAnthropic
RESEARCH

Study Exposes 37 Dark Patterns Exploiting Users in AI Chatbots from OpenAI, Google, Anthropic, Meta, and Others

2026-05-29
AnthropicAnthropic
INDUSTRY REPORT

Mystery Company Burns $500M on Claude AI in Single Month Due to Uncontrolled Usage

2026-05-29
AnthropicAnthropic
INDUSTRY REPORT

Salesforce Engineering Transforms SDLC with Agentic Claude: 18x Faster Migrations, Better Quality

2026-05-29

Comments

Suggested

Google / AlphabetGoogle / Alphabet
RESEARCH

Google Unveils Community Reasoning Training Techniques from Tunix Hackathon

2026-05-29
BerzeShiftBerzeShift
PRODUCT LAUNCH

Shift Will Clean Your Home for Free to Train Future Robots

2026-05-29
DeepSeekDeepSeek
RESEARCH

Inference Scaling for Reasoning-Centric LLMs: New Framework Reveals Bottlenecks in Dense vs. Sparse Models

2026-05-29
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us