Snyk VulnBench Study Reveals Inconsistent Repeatability in LLM Security Scanning
Key Takeaways
- ▸LLM security findings show high variance across identical scans, with only 27% of exploratory findings repeating consistently
- ▸Claude demonstrated stable behavior (85% consistency) when findings matched known vulnerability patterns in Snyk Code
- ▸Deterministic SAST tools remain superior for systematic, repeatable vulnerability enumeration
Summary
Snyk has released VulnBench JavaScript 1.0, a research benchmark designed to measure the repeatability of LLM-based security review. The study ran 300 repeated vulnerability-finding scans on identical JavaScript code to assess how consistently large language models identify the same security bugs, with Claude as a primary test subject.
The findings reveal a stark divide in LLM reliability. When Claude's findings matched known Snyk Code reference vulnerabilities, results were highly stable—134 of 158 unique reference-matched findings appeared in all five identical test repetitions. However, additional findings that didn't match known references were far less consistent: only 22 of 80 unique unmatched findings appeared in all five runs, with 80 appearing just once. This suggests LLMs excel at pattern-matching against known vulnerability types but struggle with consistent exploratory detection.
The research demonstrates that deterministic SAST (static application security testing) tools like Snyk Code remain superior for systematic enumeration of data-flow sinks, while agentic LLMs excel at recognizing familiar exploit patterns. Snyk concludes that combining both approaches yields the most effective security coverage, rather than treating either technique as a replacement for the other.
- Hybrid approach combining agentic LLM review with SAST tools is more effective than either alone
- Research identifies a potential Snyk Code product gap where Claude found a vulnerability SAST missed
Editorial Opinion
This research addresses a critical blind spot in the growing adoption of LLMs for security: consistency matters. While Claude shows promise at pattern-matching against known vulnerability types, the high variance in exploratory findings raises important questions about whether general-purpose models can reliably perform specialized security tasks without augmentation. Snyk's pragmatic conclusion—that LLMs and deterministic SAST are complementary rather than competitive—reflects mature thinking about AI's role in security, but security teams must understand these limitations before deploying LLM-based tools as primary scanners.



