Benchmarking GitHub Copilot CLI's /security-review: Haiku 4.5 Matches Sonnet at a Third of the Cost
Key Takeaways
- ▸Claude Haiku 4.5 matched Sonnet 4.6's 86% detection rate at just 1/3 the cost, demonstrating that cheaper models can be competitive for security-critical code review tasks
- ▸GitHub Copilot CLI includes an undocumented /security-review command that supports multiple LLM backends and can run non-interactively for automated PR scanning
- ▸Single security scans show high inter-run variance (6-15% standard deviation); multiple passes improve detection reliability and reduce the risk of missing vulnerabilities
Summary
An independent developer conducted a rigorous benchmarking study of GitHub Copilot CLI's experimental /security-review feature, testing how five frontier large language models perform at detecting code vulnerabilities. Using the deliberately vulnerable OWASP Juice Shop as a testbed, the developer created 10 code changes reintroducing 14 catalogued vulnerabilities (SQL injection, weak crypto, XXE, hardcoded credentials, etc.) and ran 200 independent security reviews across Claude Haiku 4.5, Sonnet 4.6, Opus 4.6, and OpenAI's GPT-5.4 and 5.5.
A fixed Opus 4.6 grader evaluated results against the ground truth catalog to maintain consistent scoring across models. The standout finding challenges conventional wisdom about LLM pricing: Claude Haiku 4.5 achieved an 86% mean detection rate—matching Sonnet 4.6 exactly—while costing only 3.3 credits per 10-change sweep versus Sonnet's 10 credits. Though Sonnet produced slightly fewer false positives on average (0.8 vs 1.2), the threefold cost difference makes Haiku the more efficient choice for bulk security review operations.
The study revealed significant variance between independent runs of the same model (6-15% standard deviation), suggesting that a single security scan is not a reliable proxy for a model's true detection capability. These findings suggest an optimal workflow: use cheaper models for baseline security review, then optionally run a larger model to triage uncertain findings.
- The optimal cost-efficient strategy is to use cheaper models for baseline scanning, then optionally escalate to expensive models only for edge cases or high-risk code
Editorial Opinion
This benchmarking study arrives at a moment when the LLM market is increasingly commoditized—the finding that cheaper models match expensive ones challenges the reflexive 'pay for performance' assumption in security-critical applications. The methodology is robust: controlled test set, fixed grader, multiple runs, and published ground truth. However, 200 reviews from a single vulnerable application provides directional evidence, not definitive proof; organizations should run their own tests on representative codebases before adopting these cost assumptions. The real value here isn't which model wins, but that rigorous benchmarking data now exists—security teams can make evidence-based cost-performance trade-offs instead of assumptions.



