Comprehensive Pentesting Benchmark: 18 LLM Models Tested with Agentic AI Tool Strix
Key Takeaways
- ▸First comprehensive practical benchmark comparing 18 LLM models specifically for autonomous pentesting tool performance, moving beyond theoretical evaluations
- ▸Testing methodology accounts for real-world constraints including API provider limitations, pricing, and rate limits rather than idealized scenarios
- ▸Results include detailed metrics on both vulnerability discovery effectiveness and cost-per-test, enabling security teams to make informed model selection decisions
Summary
An independent security researcher conducted an extensive 100-hour evaluation of Strix, an autonomous AI pentesting tool, across 18 different large language models to determine which models perform best for autonomous vulnerability discovery. The researcher developed a rigorous testing methodology using a controlled lab environment containing two web applications with 14 intentional vulnerabilities (CVSS scores ranging from 4.8 to 9.9, totaling a maximum score of 105.2). Tests measured each model's ability to identify distinct vulnerabilities through black-box penetration testing, with results averaged across three runs per model. The study recorded not only vulnerability discovery rates but also execution costs and token consumption across major API providers including OpenAI, Google Vertex, and OpenRouter.
The research addresses a significant gap in practical benchmarking for agentic AI systems, as most existing LLM evaluations focus on traditional benchmarks rather than real-world security tool performance. By establishing a reproducible methodology that accounts for actual deployment constraints like rate limits and pricing tiers, the researcher provides actionable insights for security teams considering AI-assisted pentesting tools. The findings offer valuable comparative data on which models deliver the best balance of vulnerability discovery, cost efficiency, and practical usability in autonomous security testing scenarios.
- Autonomous pentesting with agentic AI shows variable performance across models, with significant implications for selecting appropriate tools for security operations
Editorial Opinion
This independent research fills a critical void in practical AI security tooling evaluation. While many LLM benchmarks focus on academic tasks, real-world assessments of agentic systems performing actual security work are rare and valuable. The methodology—accounting for API constraints, costs, and practical deployability—sets a useful standard for future benchmarking. However, the lab-specific nature of the results highlights an important limitation: findings may not generalize across different target types, complexity levels, or adversarial scenarios, suggesting the need for broader, industry-wide standardized testing frameworks.



