DeepSeek Leads in Security Exploit Challenge Across LLM Providers

Key Takeaways

▸DeepSeek V4 Pro achieved highest success rate (3/10) in identifying Firebase-based vulnerabilities, while Deepseek Flash, Gemini, and Step models achieved 0/10
▸Claude models (Sonnet and Opus) showed strong technical understanding but were consistently halted by security guardrails, suggesting effective safety training
▸Google Gemini models had immediate refusal rates, limiting exploration of exploitation vectors

Source:

Hacker Newshttps://kasra.blog/blog/i-spent-1500-seeing-if-llms-could-hack-my-app/↗

Summary

Security researcher Kasra conducted a comparative analysis of large language models' ability to identify and exploit a vulnerability in a deliberately vulnerable React Native app. Spending $1,500 across multiple runs, the researcher tested nine LLM variants and found significant variance in performance: DeepSeek V4 Pro achieved the best results with a 3/10 success rate, while Claude, Gemini, and other models showed limited exploitation capabilities with security guardrails frequently halting attempts.

The vulnerability tested was a common real-world pattern—hardened API security paired with exposed Firebase credentials in the app binary, allowing direct unauthorized database access. Most models that attempted exploitation quickly identified the Firebase attack surface as the primary path. However, models showed inconsistent behavior: Deepseek V4 Pro sometimes got distracted by API/app vectors, Gemini models refused all attempts citing security concerns, and Claude Opus frequently hit safety guardrails near solution.

The research highlighted a pattern across the industry: current LLMs show limited capability for systematic security exploitation, with success heavily influenced by model architecture, safety training, and cost constraints. Claude models demonstrated particular caution around exploitation tasks, with Opus stopping runs due to security considerations despite being on the right technical path.

Cost per successful exploit varied dramatically: $333/solve for DeepSeek V4 Pro vs. $900+/solve for Claude models that solved the challenge
Security approach varies significantly across providers—some refuse entirely (Gemini), some implement late-stage guardrails (Claude, Gemini Flash), while others show less constraint (DeepSeek)

Editorial Opinion

This benchmark reveals an uncomfortable truth: LLMs currently show limited systematic capability for security exploitation, but the variance is striking. DeepSeek's relative success, combined with Claude's safety interventions that sometimes hindered but ultimately protected, suggests guardrails work—but at a cost to capability. For security research and penetration testing, this implies LLMs remain immature tools that require significant human judgment and supervision. The real story isn't that LLMs can hack apps (they can't, reliably), but that safety implementations vary wildly across providers, with implications for how organizations should evaluate LLM trustworthiness in sensitive contexts.

DeepSeek Leads in Security Exploit Challenge Across LLM Providers

Key Takeaways

▸DeepSeek V4 Pro achieved highest success rate (3/10) in identifying Firebase-based vulnerabilities, while Deepseek Flash, Gemini, and Step models achieved 0/10
▸Claude models (Sonnet and Opus) showed strong technical understanding but were consistently halted by security guardrails, suggesting effective safety training
▸Google Gemini models had immediate refusal rates, limiting exploration of exploitation vectors

Summary

Cost per successful exploit varied dramatically: $333/solve for DeepSeek V4 Pro vs. $900+/solve for Claude models that solved the challenge
Security approach varies significantly across providers—some refuse entirely (Gemini), some implement late-stage guardrails (Claude, Gemini Flash), while others show less constraint (DeepSeek)

Editorial Opinion

This benchmark reveals an uncomfortable truth: LLMs currently show limited systematic capability for security exploitation, but the variance is striking. DeepSeek's relative success, combined with Claude's safety interventions that sometimes hindered but ultimately protected, suggests guardrails work—but at a cost to capability. For security research and penetration testing, this implies LLMs remain immature tools that require significant human judgment and supervision. The real story isn't that LLMs can hack apps (they can't, reliably), but that safety implementations vary wildly across providers, with implications for how organizations should evaluate LLM trustworthiness in sensitive contexts.

DeepSeek Leads in Security Exploit Challenge Across LLM Providers

Key Takeaways

Summary

Editorial Opinion

More from DeepSeek

Researchers Decode Hidden Reasoning in Frontier LLMs, Revealing Computation Beyond Chain-of-Thought

The US-China AI Arms Race Shifts: Open-Source Models Challenge Western Dominance

Indian Companies Turn to Cheaper Chinese LLMs Amid Rising AI Costs

Comments

Suggested

Wolfram Launches LLM Benchmark for Code Generation Tasks

OpenAI Reduces Codex Model Context Window from 372k to 272k Tokens

Study: Generative AI Not Yet Displacing Young Workers in Norway

DeepSeek Leads in Security Exploit Challenge Across LLM Providers

Key Takeaways

Summary

Editorial Opinion

More from DeepSeek

Researchers Decode Hidden Reasoning in Frontier LLMs, Revealing Computation Beyond Chain-of-Thought

The US-China AI Arms Race Shifts: Open-Source Models Challenge Western Dominance

Indian Companies Turn to Cheaper Chinese LLMs Amid Rising AI Costs

Comments

Suggested

Wolfram Launches LLM Benchmark for Code Generation Tasks

OpenAI Reduces Codex Model Context Window from 372k to 272k Tokens

Study: Generative AI Not Yet Displacing Young Workers in Norway