DeepSeek V4 Pro and Flash Positioned Between Kimi and Claude in Independent Benchmark Test

Key Takeaways

▸DeepSeek V4 Pro scores 77/100 in independent benchmark, positioning between Claude Opus 4.7 (91) and Kimi K2.6 (68), with aggressive 75% discount available through May 31, 2026
▸DeepSeek V4 Flash achieves unprecedented cost efficiency at $0.02 output token cost, roughly 1/89th of Claude Opus 4.7
▸Both DeepSeek models demonstrate strong architectural understanding but fail in complex infrastructure scenarios, particularly with lease expiry validation and database management

Source:

Hacker Newshttps://blog.kilo.ai/p/we-tested-deepseek-v4-pro-and-flash↗

Summary

Independent testing of DeepSeek's newly launched V4 Pro and Flash models reveals competitive positioning in the large language model landscape. DeepSeek V4 Pro, released on April 24, 2026 under MIT license, achieved a score of 77/100 for $2.25 in a sophisticated benchmark test, positioning it between Claude Opus 4.7 (91) and Kimi K2.6 (68). DeepSeek V4 Flash, the lightweight model in the new two-tier lineup, scored 60/100 for just $0.02, offering unprecedented price-per-token value—output tokens cost less than 1/14th of Kimi and 1/89th of Claude Opus. Additionally, DeepSeek is offering a 75% discount on V4 Pro through May 31, 2026, and has permanently reduced input cache pricing across its lineup by 90%, significantly improving cost efficiency for enterprise use cases.

The test used a FlowGraph specification—a complex workflow orchestration backend with 20 endpoints, persistent state, lease management, retries, and event streaming—to evaluate models under realistic infrastructure demands rather than typical lightweight benchmarks. Both DeepSeek models were tested in thinking mode against the same prompt and scoring rubric used for the Claude Opus 4.7 vs Kimi K2.6 comparison. The testing revealed that while DeepSeek V4 Pro demonstrated strong architectural understanding and reasonable project structure, both models exhibited implementation flaws that prevented fully passing builds and test suites.

DeepSeek V4 Pro passed its own test suite but encountered TypeScript build failures, while V4 Flash's test suite never executed due to database reset errors in the setup script. Detailed code review and targeted reproduction testing identified common issues with both models: lease expiry handling, scheduling logic, validation, and build integrity. These findings suggest systematic challenges for models handling complex stateful systems, comparable to similar issues observed with Kimi K2.6.

Permanent 90% reduction in input cache pricing across DeepSeek's lineup improves overall cost positioning for enterprise applications
Infrastructure-level testing using FlowGraph orchestration revealed implementation gaps not visible in simpler benchmarks, demonstrating the importance of rigorous real-world scenario validation

Editorial Opinion

DeepSeek's V4 lineup represents a significant price-performance breakthrough, particularly for cost-sensitive applications through V4 Flash. However, the benchmark results suggest that competitive performance at scale requires more than architectural parity—it demands rigorous infrastructure-level validation and proper implementation of stateful system semantics. The finding that both DeepSeek models struggled with lease management and workflow orchestration highlights an often-overlooked challenge in production AI systems: models must handle not just isolated reasoning tasks, but the operational complexity of distributed systems.

DeepSeek V4 Pro and Flash Positioned Between Kimi and Claude in Independent Benchmark Test

Key Takeaways

▸DeepSeek V4 Pro scores 77/100 in independent benchmark, positioning between Claude Opus 4.7 (91) and Kimi K2.6 (68), with aggressive 75% discount available through May 31, 2026
▸DeepSeek V4 Flash achieves unprecedented cost efficiency at $0.02 output token cost, roughly 1/89th of Claude Opus 4.7
▸Both DeepSeek models demonstrate strong architectural understanding but fail in complex infrastructure scenarios, particularly with lease expiry validation and database management

Summary

Permanent 90% reduction in input cache pricing across DeepSeek's lineup improves overall cost positioning for enterprise applications
Infrastructure-level testing using FlowGraph orchestration revealed implementation gaps not visible in simpler benchmarks, demonstrating the importance of rigorous real-world scenario validation

Editorial Opinion

DeepSeek's V4 lineup represents a significant price-performance breakthrough, particularly for cost-sensitive applications through V4 Flash. However, the benchmark results suggest that competitive performance at scale requires more than architectural parity—it demands rigorous infrastructure-level validation and proper implementation of stateful system semantics. The finding that both DeepSeek models struggled with lease management and workflow orchestration highlights an often-overlooked challenge in production AI systems: models must handle not just isolated reasoning tasks, but the operational complexity of distributed systems.

DeepSeek V4 Pro and Flash Positioned Between Kimi and Claude in Independent Benchmark Test

Key Takeaways

Summary

Editorial Opinion

More from DeepSeek

China's AI Industry Operates Under State Direction as Government Backs DeepSeek with $50B Valuation

Two Years of Local AI on a Laptop: When Open Models Outpaced Moore's Law

We Spent 10 Days Touring Chinese AI Labs. Here's What We Saw

Comments

Suggested

Anthropic Secures $30B Funding at $900B Valuation in Historic AI Investment Round

Agentic AI Set to Reach 80% of Premium Smartphones by 2027, Spreading to Wearables

AI-Assisted Code Shipping Explodes: GitHub Reports 10x Increase Year-Over-Year

DeepSeek V4 Pro and Flash Positioned Between Kimi and Claude in Independent Benchmark Test

Key Takeaways

Summary

Editorial Opinion

More from DeepSeek

China's AI Industry Operates Under State Direction as Government Backs DeepSeek with $50B Valuation

Two Years of Local AI on a Laptop: When Open Models Outpaced Moore's Law

We Spent 10 Days Touring Chinese AI Labs. Here's What We Saw

Comments

Suggested

Anthropic Secures $30B Funding at $900B Valuation in Historic AI Investment Round

Agentic AI Set to Reach 80% of Premium Smartphones by 2027, Spreading to Wearables

AI-Assisted Code Shipping Explodes: GitHub Reports 10x Increase Year-Over-Year