Researchers Challenge HPC Dogma: FP8 With Ozaki Scheme II Can Match FP64 Accuracy on NVIDIA's Blackwell GPUs
Key Takeaways
- ▸NVIDIA B300's native FP64 performance collapsed to 1.3 TFLOPS (31x slower than B200), making the decades-old assumption that hardware FP64 is essential for HPC obsolete
- ▸The Ozaki Scheme II algorithm, combined with register-level fusion, enables FP8 to emulate full FP64 accuracy with negligible overhead, achieving ~500 TFLOPS on B300 and exceeding H100 performance
- ▸The new Tensor-Memory Equilibrium (TME) model provides a unified framework for predicting and optimizing FP8-based HPC kernels across memory-bound and compute-bound regimes
Summary
A new arXiv paper challenges decades of conventional wisdom in high-performance computing (HPC), arguing that native 64-bit floating-point (FP64) hardware is no longer essential for scientific computing. The research demonstrates that 8-bit floating-point (FP8) precision, when paired with the Ozaki Scheme II reconstruction algorithm (based on the Chinese Remainder Theorem), can deliver full FP64 accuracy on NVIDIA's Blackwell Ultra (B300) generation and later AI-optimized GPUs.
The paper identifies a critical performance cliff in NVIDIA's latest architecture: the B300's native FP64 capability has regressed to just 1.3 TFLOPS—a stunning 31x slowdown compared to the prior B200 generation. This degradation makes even memory-bound kernels (sparse matrix-vector multiplication, matrix-vector multiplication, stencil operations) compute-bound. The researchers introduce the Tensor-Memory Equilibrium (TME) model, which augments the classical Roofline performance model with three new parameters: a compute multiplier, bandwidth multiplier, and reconstruction latency. Using register-level fusion techniques, they demonstrate that FP8 emulation overhead can be effectively hidden behind memory access latencies, making the approach practical.
Projections indicate that Ozaki II can deliver approximately 500 TFLOPS of effective FP64-equivalent performance on the B300—exceeding even the B200's native FP64 ceiling by over an order of magnitude in compute-bound scenarios, while matching memory bandwidth limits in bandwidth-bound workloads. Against an H100 baseline, the Ozaki II approach matches or exceeds performance across all tested kernels, whereas B300 native FP64 imposes regressions of up to 50x. The findings fundamentally challenge the assumption that specialized high-precision hardware is the bottleneck for scientific computing, positioning NVIDIA's abundant FP8 tensor cores as the true performance frontier for HPC.
- FP8 with Ozaki II matches or exceeds H100 across all tested workloads, while native B300 FP64 suffers up-to-50x regression, signaling a paradigm shift in HPC architecture design
Editorial Opinion
This research exposes a fundamental architectural misstep: NVIDIA's decision to drastically reduce native FP64 throughput on Blackwell assumes end-users will either accept the performance cliff or adopt hybrid precision techniques—yet the authors elegantly prove the latter path works at scale. The paper's elegance lies not in blaming NVIDIA, but in demonstrating that GPU tensor architecture has finally matured enough to offer superior FP8 throughput (abundant) paired with post-processing reconstruction (cheap) rather than expensive silicon devoted to rarely-used FP64 instructions. If these projections hold under real-world workload testing, this signals a tectonic shift in HPC hardware requirements and challenges GPU vendors to justify further native FP64 silicon investment.



