BotBeat
...
← Back

> ▌

DeepSeekDeepSeek
RESEARCHDeepSeek2026-06-04

DeepSeek Leads in Security Exploit Challenge Across LLM Providers

Key Takeaways

  • ▸DeepSeek V4 Pro achieved highest success rate (3/10) in identifying Firebase-based vulnerabilities, while Deepseek Flash, Gemini, and Step models achieved 0/10
  • ▸Claude models (Sonnet and Opus) showed strong technical understanding but were consistently halted by security guardrails, suggesting effective safety training
  • ▸Google Gemini models had immediate refusal rates, limiting exploration of exploitation vectors
Source:
Hacker Newshttps://kasra.blog/blog/i-spent-1500-seeing-if-llms-could-hack-my-app/↗

Summary

Security researcher Kasra conducted a comparative analysis of large language models' ability to identify and exploit a vulnerability in a deliberately vulnerable React Native app. Spending $1,500 across multiple runs, the researcher tested nine LLM variants and found significant variance in performance: DeepSeek V4 Pro achieved the best results with a 3/10 success rate, while Claude, Gemini, and other models showed limited exploitation capabilities with security guardrails frequently halting attempts.

The vulnerability tested was a common real-world pattern—hardened API security paired with exposed Firebase credentials in the app binary, allowing direct unauthorized database access. Most models that attempted exploitation quickly identified the Firebase attack surface as the primary path. However, models showed inconsistent behavior: Deepseek V4 Pro sometimes got distracted by API/app vectors, Gemini models refused all attempts citing security concerns, and Claude Opus frequently hit safety guardrails near solution.

The research highlighted a pattern across the industry: current LLMs show limited capability for systematic security exploitation, with success heavily influenced by model architecture, safety training, and cost constraints. Claude models demonstrated particular caution around exploitation tasks, with Opus stopping runs due to security considerations despite being on the right technical path.

  • Cost per successful exploit varied dramatically: $333/solve for DeepSeek V4 Pro vs. $900+/solve for Claude models that solved the challenge
  • Security approach varies significantly across providers—some refuse entirely (Gemini), some implement late-stage guardrails (Claude, Gemini Flash), while others show less constraint (DeepSeek)

Editorial Opinion

This benchmark reveals an uncomfortable truth: LLMs currently show limited systematic capability for security exploitation, but the variance is striking. DeepSeek's relative success, combined with Claude's safety interventions that sometimes hindered but ultimately protected, suggests guardrails work—but at a cost to capability. For security research and penetration testing, this implies LLMs remain immature tools that require significant human judgment and supervision. The real story isn't that LLMs can hack apps (they can't, reliably), but that safety implementations vary wildly across providers, with implications for how organizations should evaluate LLM trustworthiness in sensitive contexts.

Machine LearningCybersecurityMarket TrendsAI Safety & Alignment

More from DeepSeek

DeepSeekDeepSeek
INDUSTRY REPORT

China's AI Valuation Boom: Are Billion-Dollar Unicorns Built on Substance or Speculation?

2026-05-30
DeepSeekDeepSeek
RESEARCH

Inference Scaling for Reasoning-Centric LLMs: New Framework Reveals Bottlenecks in Dense vs. Sparse Models

2026-05-29
DeepSeekDeepSeek
UPDATE

DeepSeek Slashes AI Costs to Cents, Permanently Disrupting Enterprise Pricing Models

2026-05-29

Comments

Suggested

MetaMeta
PRODUCT LAUNCH

Meta Accelerates AI-Powered Wearables Push with AI Pendant and Four New Smart Glasses Models in 2026

2026-06-04
OpenAIOpenAI
INDUSTRY REPORT

Malicious NPM Package Targeting OpenAI Codex Users Exfiltrates Authentication Tokens

2026-06-04
MetaMeta
RESEARCH

MIT Researchers Show Smaller AI Models Can Compete with Frontier Models Through Better Question-Asking

2026-06-04
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us