DeepSeek Leads in Security Exploit Challenge Across LLM Providers
Key Takeaways
- ▸DeepSeek V4 Pro achieved highest success rate (3/10) in identifying Firebase-based vulnerabilities, while Deepseek Flash, Gemini, and Step models achieved 0/10
- ▸Claude models (Sonnet and Opus) showed strong technical understanding but were consistently halted by security guardrails, suggesting effective safety training
- ▸Google Gemini models had immediate refusal rates, limiting exploration of exploitation vectors
Summary
Security researcher Kasra conducted a comparative analysis of large language models' ability to identify and exploit a vulnerability in a deliberately vulnerable React Native app. Spending $1,500 across multiple runs, the researcher tested nine LLM variants and found significant variance in performance: DeepSeek V4 Pro achieved the best results with a 3/10 success rate, while Claude, Gemini, and other models showed limited exploitation capabilities with security guardrails frequently halting attempts.
The vulnerability tested was a common real-world pattern—hardened API security paired with exposed Firebase credentials in the app binary, allowing direct unauthorized database access. Most models that attempted exploitation quickly identified the Firebase attack surface as the primary path. However, models showed inconsistent behavior: Deepseek V4 Pro sometimes got distracted by API/app vectors, Gemini models refused all attempts citing security concerns, and Claude Opus frequently hit safety guardrails near solution.
The research highlighted a pattern across the industry: current LLMs show limited capability for systematic security exploitation, with success heavily influenced by model architecture, safety training, and cost constraints. Claude models demonstrated particular caution around exploitation tasks, with Opus stopping runs due to security considerations despite being on the right technical path.
- Cost per successful exploit varied dramatically: $333/solve for DeepSeek V4 Pro vs. $900+/solve for Claude models that solved the challenge
- Security approach varies significantly across providers—some refuse entirely (Gemini), some implement late-stage guardrails (Claude, Gemini Flash), while others show less constraint (DeepSeek)
Editorial Opinion
This benchmark reveals an uncomfortable truth: LLMs currently show limited systematic capability for security exploitation, but the variance is striking. DeepSeek's relative success, combined with Claude's safety interventions that sometimes hindered but ultimately protected, suggests guardrails work—but at a cost to capability. For security research and penetration testing, this implies LLMs remain immature tools that require significant human judgment and supervision. The real story isn't that LLMs can hack apps (they can't, reliably), but that safety implementations vary wildly across providers, with implications for how organizations should evaluate LLM trustworthiness in sensitive contexts.


