CVE-Bench: New Benchmark Tests Whether AI Can Actually Fix Real-World Security Vulnerabilities

Key Takeaways

▸CVE-Bench is the first benchmark specifically designed to test AI agents on fixing real-world security vulnerabilities
▸Testing reveals that fixing vulnerabilities is significantly more complex than identifying them, despite AI claims of security expertise
▸Many open-source security fixes lack proper test coverage, limiting validation of AI-generated patches

Source:

Hacker Newshttps://giovannigatti.github.io/cve-bench/↗

Summary

CVE-Bench is a new research benchmark designed to test whether AI agents can fix real-world security vulnerabilities. Created by researcher logickkk1, the benchmark includes 20 CVEs from 18 diverse Python projects (including Pillow, GitPython, yt-dlp, and urllib3), testing five AI models across three prompt conditions. Each agent runs in a sandboxed container and is scored against maintainer security tests, covering vulnerabilities with CVSS scores from 2.1 to 9.8 across 15 different Common Weakness Enumeration (CWE) categories.

The creation of CVE-Bench was prompted by Anthropic's early 2026 claim that their Mythos model finds security vulnerabilities better than human experts—yet security vulnerabilities continue to rise. The benchmark tests multiple AI models including those from Anthropic and Poolside (Laguna), providing rigorous evaluation of whether AI can actually fix the problems it claims to identify. Critically, the research reveals that fixing vulnerabilities is significantly more complex than identifying them, with many open-source projects failing to include proper test coverage for security patches.

CVE-Bench addresses a critical gap in AI evaluation benchmarking. While general code-fixing benchmarks like SWE-Bench exist, this is the first systematic benchmark specifically for security vulnerability patching using real-world CVEs and actual maintainer test suites. The benchmark provides vulnerable and fixed git SHAs, Docker container setup scripts, and manually curated security tests that expose vulnerabilities without revealing fixes, creating a reproducible environment for evaluating AI security capabilities.

The benchmark tests multiple AI models across diverse Python projects with CVSS scores ranging from 2.1 to 9.8
Real-world benchmark results show that vulnerability fixing requires understanding domain-specific API changes and security implications

Editorial Opinion

CVE-Bench is a crucial contribution to AI safety evaluation at a critical moment. As models like Anthropic's Mythos make increasingly bold claims about security capabilities, we need rigorous real-world benchmarks to ground expectations in reality. The gap between identifying vulnerabilities and actually fixing them—especially when open-source maintainers often skip proper security testing—exposes both an AI limitation and a systemic software engineering problem. This benchmark could become essential infrastructure for the AI safety community.

CVE-Bench: New Benchmark Tests Whether AI Can Actually Fix Real-World Security Vulnerabilities

Key Takeaways

▸CVE-Bench is the first benchmark specifically designed to test AI agents on fixing real-world security vulnerabilities
▸Testing reveals that fixing vulnerabilities is significantly more complex than identifying them, despite AI claims of security expertise
▸Many open-source security fixes lack proper test coverage, limiting validation of AI-generated patches

Summary

The benchmark tests multiple AI models across diverse Python projects with CVSS scores ranging from 2.1 to 9.8
Real-world benchmark results show that vulnerability fixing requires understanding domain-specific API changes and security implications

Editorial Opinion

CVE-Bench is a crucial contribution to AI safety evaluation at a critical moment. As models like Anthropic's Mythos make increasingly bold claims about security capabilities, we need rigorous real-world benchmarks to ground expectations in reality. The gap between identifying vulnerabilities and actually fixing them—especially when open-source maintainers often skip proper security testing—exposes both an AI limitation and a systemic software engineering problem. This benchmark could become essential infrastructure for the AI safety community.

CVE-Bench: New Benchmark Tests Whether AI Can Actually Fix Real-World Security Vulnerabilities

Key Takeaways

Summary

Editorial Opinion

More from Anthropic

Microsoft Study Quantifies Productivity Gains from Claude Code and GitHub Copilot CLI

Anthropic's Fable 5 Outperforms Opus 4.8 at Lower Cost with Fusion Architecture

Economists call for urgent action on AI's economic impact

Comments

Suggested

Cdbx Launches AI-Powered Browser IDE to Build Apps from Plain English Descriptions

Soofi Consortium Announces Soofi S: Europe's First Sovereign Industrial Foundation Model

Real-World AI-Generated Code More Similar to Human Code Than Lab Studies Suggested, Large-Scale Study Finds

CVE-Bench: New Benchmark Tests Whether AI Can Actually Fix Real-World Security Vulnerabilities

Key Takeaways

Summary

Editorial Opinion

More from Anthropic

Microsoft Study Quantifies Productivity Gains from Claude Code and GitHub Copilot CLI

Anthropic's Fable 5 Outperforms Opus 4.8 at Lower Cost with Fusion Architecture

Economists call for urgent action on AI's economic impact

Comments

Suggested

Cdbx Launches AI-Powered Browser IDE to Build Apps from Plain English Descriptions

Soofi Consortium Announces Soofi S: Europe's First Sovereign Industrial Foundation Model

Real-World AI-Generated Code More Similar to Human Code Than Lab Studies Suggested, Large-Scale Study Finds