DeepSeek V4 Pro and Flash Positioned Between Kimi and Claude in Independent Benchmark Test
Key Takeaways
- ▸DeepSeek V4 Pro scores 77/100 in independent benchmark, positioning between Claude Opus 4.7 (91) and Kimi K2.6 (68), with aggressive 75% discount available through May 31, 2026
- ▸DeepSeek V4 Flash achieves unprecedented cost efficiency at $0.02 output token cost, roughly 1/89th of Claude Opus 4.7
- ▸Both DeepSeek models demonstrate strong architectural understanding but fail in complex infrastructure scenarios, particularly with lease expiry validation and database management
Summary
Independent testing of DeepSeek's newly launched V4 Pro and Flash models reveals competitive positioning in the large language model landscape. DeepSeek V4 Pro, released on April 24, 2026 under MIT license, achieved a score of 77/100 for $2.25 in a sophisticated benchmark test, positioning it between Claude Opus 4.7 (91) and Kimi K2.6 (68). DeepSeek V4 Flash, the lightweight model in the new two-tier lineup, scored 60/100 for just $0.02, offering unprecedented price-per-token value—output tokens cost less than 1/14th of Kimi and 1/89th of Claude Opus. Additionally, DeepSeek is offering a 75% discount on V4 Pro through May 31, 2026, and has permanently reduced input cache pricing across its lineup by 90%, significantly improving cost efficiency for enterprise use cases.
The test used a FlowGraph specification—a complex workflow orchestration backend with 20 endpoints, persistent state, lease management, retries, and event streaming—to evaluate models under realistic infrastructure demands rather than typical lightweight benchmarks. Both DeepSeek models were tested in thinking mode against the same prompt and scoring rubric used for the Claude Opus 4.7 vs Kimi K2.6 comparison. The testing revealed that while DeepSeek V4 Pro demonstrated strong architectural understanding and reasonable project structure, both models exhibited implementation flaws that prevented fully passing builds and test suites.
DeepSeek V4 Pro passed its own test suite but encountered TypeScript build failures, while V4 Flash's test suite never executed due to database reset errors in the setup script. Detailed code review and targeted reproduction testing identified common issues with both models: lease expiry handling, scheduling logic, validation, and build integrity. These findings suggest systematic challenges for models handling complex stateful systems, comparable to similar issues observed with Kimi K2.6.
- Permanent 90% reduction in input cache pricing across DeepSeek's lineup improves overall cost positioning for enterprise applications
- Infrastructure-level testing using FlowGraph orchestration revealed implementation gaps not visible in simpler benchmarks, demonstrating the importance of rigorous real-world scenario validation
Editorial Opinion
DeepSeek's V4 lineup represents a significant price-performance breakthrough, particularly for cost-sensitive applications through V4 Flash. However, the benchmark results suggest that competitive performance at scale requires more than architectural parity—it demands rigorous infrastructure-level validation and proper implementation of stateful system semantics. The finding that both DeepSeek models struggled with lease management and workflow orchestration highlights an often-overlooked challenge in production AI systems: models must handle not just isolated reasoning tasks, but the operational complexity of distributed systems.



