Guess-Verify-Refine: Data-Aware Algorithm Achieves 1.88x Speedup for Sparse-Attention Decoding on Blackwell
Key Takeaways
- ▸GVR delivers 1.88x average and up to 2.42x peak speedup for Top-K selection on Blackwell, addressing a critical latency bottleneck
- ▸End-to-end inference latency improves by up to 7.52% at 100K context, with larger gains at longer sequences
- ▸Technique is validated on DeepSeek-V3.2 in TensorRT-LLM production stack, demonstrating real-world applicability
Summary
A new optimization technique called Guess-Verify-Refine (GVR) has been developed to address a latency bottleneck in sparse-attention decoding on NVIDIA's Blackwell architecture. The algorithm exploits temporal correlation across consecutive decode steps to optimize the Top-K selection process, which is critical for long-context LLM serving. GVR uses the previous step's Top-K results as a prediction signal, applies pre-indexed statistics, and narrows candidate thresholds through a secant-style counting approach before performing final exact selection.
The optimization delivers substantial performance improvements: an average 1.88x speedup over production radix-select kernels, with peak improvements reaching 2.42x per layer per decode step, while preserving bit-exact Top-K outputs. The technique was validated on real DeepSeek-V3.2 workloads integrated into TensorRT-LLM, demonstrating practical applicability. In controlled minimum-latency deployments, GVR improves end-to-end time per output token (TPOT) by up to 7.52% at 100K context lengths, with larger gains observed at longer contexts.
The algorithm connects to the mathematical structure of DeepSeek Sparse Attention (DSA) indexer scores through Toeplitz and RoPE patterns. While implemented in the current TensorRT-LLM DSA stack on Blackwell, the underlying principles may extend to other sparse-attention decoders that exhibit temporal stability in their decode-phase Top-K computations, suggesting broader applicability beyond this specific use case.
- Algorithm exploits temporal correlation in decode steps, with potential to extend to other sparse-attention decoder architectures
Editorial Opinion
This work exemplifies how algorithmic innovation can overcome optimization plateaus when conventional approaches hit diminishing returns. Even as hardware and attention kernels approach their efficiency limits, careful exploitation of data-dependent patterns—in this case, temporal correlation in decoding—can unlock meaningful performance gains. The 1.88x operator-level speedup translating to 7.52% end-to-end latency improvement suggests sparse-attention remains a viable path for efficient long-context inference, and this technique could become a standard optimization in production LLM serving stacks.



