Benchmarking GitHub Copilot CLI's /security-review: Haiku 4.5 Matches Sonnet at a Third of the Cost

Key Takeaways

▸Claude Haiku 4.5 matched Sonnet 4.6's 86% detection rate at just 1/3 the cost, demonstrating that cheaper models can be competitive for security-critical code review tasks
▸GitHub Copilot CLI includes an undocumented /security-review command that supports multiple LLM backends and can run non-interactively for automated PR scanning
▸Single security scans show high inter-run variance (6-15% standard deviation); multiple passes improve detection reliability and reduce the risk of missing vulnerabilities

Source:

Hacker Newshttps://dcairo.substack.com/p/i-spent-a-weekend-benchmarking-githubs↗

Summary

An independent developer conducted a rigorous benchmarking study of GitHub Copilot CLI's experimental /security-review feature, testing how five frontier large language models perform at detecting code vulnerabilities. Using the deliberately vulnerable OWASP Juice Shop as a testbed, the developer created 10 code changes reintroducing 14 catalogued vulnerabilities (SQL injection, weak crypto, XXE, hardcoded credentials, etc.) and ran 200 independent security reviews across Claude Haiku 4.5, Sonnet 4.6, Opus 4.6, and OpenAI's GPT-5.4 and 5.5.

A fixed Opus 4.6 grader evaluated results against the ground truth catalog to maintain consistent scoring across models. The standout finding challenges conventional wisdom about LLM pricing: Claude Haiku 4.5 achieved an 86% mean detection rate—matching Sonnet 4.6 exactly—while costing only 3.3 credits per 10-change sweep versus Sonnet's 10 credits. Though Sonnet produced slightly fewer false positives on average (0.8 vs 1.2), the threefold cost difference makes Haiku the more efficient choice for bulk security review operations.

The study revealed significant variance between independent runs of the same model (6-15% standard deviation), suggesting that a single security scan is not a reliable proxy for a model's true detection capability. These findings suggest an optimal workflow: use cheaper models for baseline security review, then optionally run a larger model to triage uncertain findings.

The optimal cost-efficient strategy is to use cheaper models for baseline scanning, then optionally escalate to expensive models only for edge cases or high-risk code

Editorial Opinion

This benchmarking study arrives at a moment when the LLM market is increasingly commoditized—the finding that cheaper models match expensive ones challenges the reflexive 'pay for performance' assumption in security-critical applications. The methodology is robust: controlled test set, fixed grader, multiple runs, and published ground truth. However, 200 reviews from a single vulnerable application provides directional evidence, not definitive proof; organizations should run their own tests on representative codebases before adopting these cost assumptions. The real value here isn't which model wins, but that rigorous benchmarking data now exists—security teams can make evidence-based cost-performance trade-offs instead of assumptions.

Benchmarking GitHub Copilot CLI's /security-review: Haiku 4.5 Matches Sonnet at a Third of the Cost

Key Takeaways

▸Claude Haiku 4.5 matched Sonnet 4.6's 86% detection rate at just 1/3 the cost, demonstrating that cheaper models can be competitive for security-critical code review tasks
▸GitHub Copilot CLI includes an undocumented /security-review command that supports multiple LLM backends and can run non-interactively for automated PR scanning
▸Single security scans show high inter-run variance (6-15% standard deviation); multiple passes improve detection reliability and reduce the risk of missing vulnerabilities

Summary

The optimal cost-efficient strategy is to use cheaper models for baseline scanning, then optionally escalate to expensive models only for edge cases or high-risk code

Editorial Opinion

This benchmarking study arrives at a moment when the LLM market is increasingly commoditized—the finding that cheaper models match expensive ones challenges the reflexive 'pay for performance' assumption in security-critical applications. The methodology is robust: controlled test set, fixed grader, multiple runs, and published ground truth. However, 200 reviews from a single vulnerable application provides directional evidence, not definitive proof; organizations should run their own tests on representative codebases before adopting these cost assumptions. The real value here isn't which model wins, but that rigorous benchmarking data now exists—security teams can make evidence-based cost-performance trade-offs instead of assumptions.

Benchmarking GitHub Copilot CLI's /security-review: Haiku 4.5 Matches Sonnet at a Third of the Cost

Key Takeaways

Summary

Editorial Opinion

More from GitHub

GitHub Brings AI-Powered Security Detection to Pull Requests

87% of AI-Generated Code Projects Have Security Issues, Report Finds Massive Quality Gaps

Enterprise Doubles Developer Productivity with AI Coding Tools: Longitudinal Study Shows 2.09x Throughput Gains

Comments

Suggested

AI Employees Emerge as New Political Donor Class, Outspending Prior Tech IPO Cohorts

Anthropic Claims Claude Has Consciousness-Like 'Global Workspace,' But Critics Question Controls and Peer Review

Soofi Introduces Europe's First Sovereign Industrial AI Model

Benchmarking GitHub Copilot CLI's /security-review: Haiku 4.5 Matches Sonnet at a Third of the Cost

Key Takeaways

Summary

Editorial Opinion

More from GitHub

GitHub Brings AI-Powered Security Detection to Pull Requests

87% of AI-Generated Code Projects Have Security Issues, Report Finds Massive Quality Gaps

Enterprise Doubles Developer Productivity with AI Coding Tools: Longitudinal Study Shows 2.09x Throughput Gains

Comments

Suggested

AI Employees Emerge as New Political Donor Class, Outspending Prior Tech IPO Cohorts

Anthropic Claims Claude Has Consciousness-Like 'Global Workspace,' But Critics Question Controls and Peer Review

Soofi Introduces Europe's First Sovereign Industrial AI Model