DeepSeek's DualPath Architecture Doubles KV-Cache Throughput by Leveraging Idle Decode-Side Network Bandwidth

Key Takeaways

▸DualPath doubles KV-Cache loading throughput by utilizing idle decode-side storage NICs alongside prefill-side NICs in disaggregated LLM serving architectures
▸The storage NIC bottleneck is acute for agentic workloads where DeepSeek reports 98.7% KV-Cache hit rates, with cache-compute ratios reaching 22 GB/PFLOP for DeepSeek-V3.2
▸Hardware evolution exacerbates the problem: GPU FLOPS grew 28.8x from Ampere to Blackwell while NIC bandwidth only doubled, creating a 14.4x drop in I/O-to-compute ratio

Source:

Hacker Newshttps://mesuvash.github.io/blog/2026/dualpath/↗

Summary

DeepSeek has published a detailed technical paper explaining DualPath, a novel architecture that addresses a critical bottleneck in large language model inference for agentic workloads. In prefill-decode disaggregated serving systems, where model processing is split between prefill engines (which process full prompts) and decode engines (which generate tokens), a significant performance constraint emerges: storage network interface cards (NICs) on the prefill side become saturated loading KV-Cache data from distributed storage, while decode-side NICs sit idle. This problem is particularly acute for multi-turn agentic workflows, where DeepSeek reports a 98.7% KV-Cache hit rate, meaning nearly all context from previous turns must be loaded from storage before each new interaction.

The DualPath solution exploits this imbalance by routing KV-Cache traffic through both prefill-side storage NICs and previously idle decode-side NICs, effectively doubling the available bandwidth for cache loading. The architecture introduces a CNIC-Centric Traffic Manager that coordinates traffic across compute NICs (CNICs) using RDMA, and an Adaptive Request Scheduler that intelligently decides whether to use the traditional storage path, the new dual-path approach, or pure CNIC transfer based on cache size and network conditions. DeepSeek's implementation addresses the fundamental hardware trend problem: from NVIDIA Ampere to Blackwell generations, GPU compute performance increased 28.8x while NIC bandwidth only grew 2.0x, creating a 14.4x drop in the I/O-to-compute ratio.

The paper demonstrates that for DeepSeek-V3.2 workloads with high cache hit rates, the cache-compute ratio reaches approximately 22 GB per petaFLOP, meaning GPUs spend more time waiting for data than computing. By activating decode-side network resources, DualPath effectively transforms a storage bandwidth problem into a network orchestration challenge, leveraging existing hardware more efficiently rather than requiring infrastructure upgrades. The approach is particularly relevant as AI workloads increasingly shift toward agentic, multi-turn interactions where context reuse is paramount.

The architecture introduces a CNIC-Centric Traffic Manager and Adaptive Request Scheduler that intelligently route traffic based on cache size and network conditions
DualPath represents a software-based solution to a hardware scaling mismatch, extracting more value from existing infrastructure rather than requiring costly network upgrades

Editorial Opinion

DualPath exemplifies the kind of systems-level innovation that will define the next phase of AI infrastructure optimization. As the industry races to scale inference for production workloads, the bottleneck has shifted from raw compute to the orchestration of data movement—a problem that won't be solved by simply buying faster GPUs. DeepSeek's insight that half the cluster's storage bandwidth sits idle in conventional architectures is both obvious in hindsight and profoundly important. This work suggests that the next frontier in AI systems isn't just about training bigger models or running faster chips, but about ruthlessly eliminating inefficiencies in how existing hardware resources are utilized across increasingly complex, multi-stage inference pipelines.

DeepSeek's DualPath Architecture Doubles KV-Cache Throughput by Leveraging Idle Decode-Side Network Bandwidth

Key Takeaways

▸DualPath doubles KV-Cache loading throughput by utilizing idle decode-side storage NICs alongside prefill-side NICs in disaggregated LLM serving architectures
▸The storage NIC bottleneck is acute for agentic workloads where DeepSeek reports 98.7% KV-Cache hit rates, with cache-compute ratios reaching 22 GB/PFLOP for DeepSeek-V3.2
▸Hardware evolution exacerbates the problem: GPU FLOPS grew 28.8x from Ampere to Blackwell while NIC bandwidth only doubled, creating a 14.4x drop in I/O-to-compute ratio

Summary

The architecture introduces a CNIC-Centric Traffic Manager and Adaptive Request Scheduler that intelligently route traffic based on cache size and network conditions
DualPath represents a software-based solution to a hardware scaling mismatch, extracting more value from existing infrastructure rather than requiring costly network upgrades

Editorial Opinion

DualPath exemplifies the kind of systems-level innovation that will define the next phase of AI infrastructure optimization. As the industry races to scale inference for production workloads, the bottleneck has shifted from raw compute to the orchestration of data movement—a problem that won't be solved by simply buying faster GPUs. DeepSeek's insight that half the cluster's storage bandwidth sits idle in conventional architectures is both obvious in hindsight and profoundly important. This work suggests that the next frontier in AI systems isn't just about training bigger models or running faster chips, but about ruthlessly eliminating inefficiencies in how existing hardware resources are utilized across increasingly complex, multi-stage inference pipelines.

DeepSeek's DualPath Architecture Doubles KV-Cache Throughput by Leveraging Idle Decode-Side Network Bandwidth

Key Takeaways

Summary

Editorial Opinion

More from DeepSeek

Huawei's Ascend Chips Successfully Enable DeepSeek-V4-Pro Post-Training, Advancing China's AI Self-Reliance

Open-Source AI Dramatically Narrows Capability Gap: From 10 Months Behind to Just 2-3.5 Months

DeepSeek Completes Full-Parameter Post-Training of V4-Pro on Huawei's Ascend 910C Chips

Comments

Suggested

Microsoft's Leaked 'Aion' Project Reveals Vision for Copilot-First Operating System

Stanford Researchers Use Multi-Agent AI and Reinforcement Learning to Improve HIP Kernel Generation for AMD GPUs

First Large-Scale Study Shows AI Adoption Drives Job Growth, Not Displacement

DeepSeek's DualPath Architecture Doubles KV-Cache Throughput by Leveraging Idle Decode-Side Network Bandwidth

Key Takeaways

Summary

Editorial Opinion

More from DeepSeek

Huawei's Ascend Chips Successfully Enable DeepSeek-V4-Pro Post-Training, Advancing China's AI Self-Reliance

Open-Source AI Dramatically Narrows Capability Gap: From 10 Months Behind to Just 2-3.5 Months

DeepSeek Completes Full-Parameter Post-Training of V4-Pro on Huawei's Ascend 910C Chips

Comments

Suggested

Microsoft's Leaked 'Aion' Project Reveals Vision for Copilot-First Operating System

Stanford Researchers Use Multi-Agent AI and Reinforcement Learning to Improve HIP Kernel Generation for AMD GPUs

First Large-Scale Study Shows AI Adoption Drives Job Growth, Not Displacement