BotBeat
...
← Back

> ▌

SnykSnyk
RESEARCHSnyk2026-06-16

Snyk VulnBench Study Reveals Inconsistent Repeatability in LLM Security Scanning

Key Takeaways

  • ▸LLM security findings show high variance across identical scans, with only 27% of exploratory findings repeating consistently
  • ▸Claude demonstrated stable behavior (85% consistency) when findings matched known vulnerability patterns in Snyk Code
  • ▸Deterministic SAST tools remain superior for systematic, repeatable vulnerability enumeration
Source:
Hacker Newshttps://arxiv.org/abs/2606.15762↗

Summary

Snyk has released VulnBench JavaScript 1.0, a research benchmark designed to measure the repeatability of LLM-based security review. The study ran 300 repeated vulnerability-finding scans on identical JavaScript code to assess how consistently large language models identify the same security bugs, with Claude as a primary test subject.

The findings reveal a stark divide in LLM reliability. When Claude's findings matched known Snyk Code reference vulnerabilities, results were highly stable—134 of 158 unique reference-matched findings appeared in all five identical test repetitions. However, additional findings that didn't match known references were far less consistent: only 22 of 80 unique unmatched findings appeared in all five runs, with 80 appearing just once. This suggests LLMs excel at pattern-matching against known vulnerability types but struggle with consistent exploratory detection.

The research demonstrates that deterministic SAST (static application security testing) tools like Snyk Code remain superior for systematic enumeration of data-flow sinks, while agentic LLMs excel at recognizing familiar exploit patterns. Snyk concludes that combining both approaches yields the most effective security coverage, rather than treating either technique as a replacement for the other.

  • Hybrid approach combining agentic LLM review with SAST tools is more effective than either alone
  • Research identifies a potential Snyk Code product gap where Claude found a vulnerability SAST missed

Editorial Opinion

This research addresses a critical blind spot in the growing adoption of LLMs for security: consistency matters. While Claude shows promise at pattern-matching against known vulnerability types, the high variance in exploratory findings raises important questions about whether general-purpose models can reliably perform specialized security tasks without augmentation. Snyk's pragmatic conclusion—that LLMs and deterministic SAST are complementary rather than competitive—reflects mature thinking about AI's role in security, but security teams must understand these limitations before deploying LLM-based tools as primary scanners.

Large Language Models (LLMs)AI AgentsMachine LearningCybersecurity

More from Snyk

SnykSnyk
RESEARCH

The 89% Problem: LLMs Resurrect Millions of Abandoned Open Source Packages, Breaking Trust Models

2026-03-05

Comments

Suggested

Google / AlphabetGoogle / Alphabet
PRODUCT LAUNCH

Pokémon Trading Card Game AI Battle Challenge Launches on Kaggle

2026-06-16
JoyAIJoyAI
RESEARCH

JoyAI Releases First Open-Source Real-Time Vision-Language Interaction Model

2026-06-16
OpenAIOpenAI
RESEARCH

Research Reveals Performance Limits of LLM Agents at Learning Hidden Systems

2026-06-16
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us