BotBeat
...
← Back

> ▌

NVIDIANVIDIA
RESEARCHNVIDIA2026-06-08

Researchers Challenge HPC Dogma: FP8 With Ozaki Scheme II Can Match FP64 Accuracy on NVIDIA's Blackwell GPUs

Key Takeaways

  • ▸NVIDIA B300's native FP64 performance collapsed to 1.3 TFLOPS (31x slower than B200), making the decades-old assumption that hardware FP64 is essential for HPC obsolete
  • ▸The Ozaki Scheme II algorithm, combined with register-level fusion, enables FP8 to emulate full FP64 accuracy with negligible overhead, achieving ~500 TFLOPS on B300 and exceeding H100 performance
  • ▸The new Tensor-Memory Equilibrium (TME) model provides a unified framework for predicting and optimizing FP8-based HPC kernels across memory-bound and compute-bound regimes
Source:
Hacker Newshttps://arxiv.org/abs/2606.06510↗

Summary

A new arXiv paper challenges decades of conventional wisdom in high-performance computing (HPC), arguing that native 64-bit floating-point (FP64) hardware is no longer essential for scientific computing. The research demonstrates that 8-bit floating-point (FP8) precision, when paired with the Ozaki Scheme II reconstruction algorithm (based on the Chinese Remainder Theorem), can deliver full FP64 accuracy on NVIDIA's Blackwell Ultra (B300) generation and later AI-optimized GPUs.

The paper identifies a critical performance cliff in NVIDIA's latest architecture: the B300's native FP64 capability has regressed to just 1.3 TFLOPS—a stunning 31x slowdown compared to the prior B200 generation. This degradation makes even memory-bound kernels (sparse matrix-vector multiplication, matrix-vector multiplication, stencil operations) compute-bound. The researchers introduce the Tensor-Memory Equilibrium (TME) model, which augments the classical Roofline performance model with three new parameters: a compute multiplier, bandwidth multiplier, and reconstruction latency. Using register-level fusion techniques, they demonstrate that FP8 emulation overhead can be effectively hidden behind memory access latencies, making the approach practical.

Projections indicate that Ozaki II can deliver approximately 500 TFLOPS of effective FP64-equivalent performance on the B300—exceeding even the B200's native FP64 ceiling by over an order of magnitude in compute-bound scenarios, while matching memory bandwidth limits in bandwidth-bound workloads. Against an H100 baseline, the Ozaki II approach matches or exceeds performance across all tested kernels, whereas B300 native FP64 imposes regressions of up to 50x. The findings fundamentally challenge the assumption that specialized high-precision hardware is the bottleneck for scientific computing, positioning NVIDIA's abundant FP8 tensor cores as the true performance frontier for HPC.

  • FP8 with Ozaki II matches or exceeds H100 across all tested workloads, while native B300 FP64 suffers up-to-50x regression, signaling a paradigm shift in HPC architecture design

Editorial Opinion

This research exposes a fundamental architectural misstep: NVIDIA's decision to drastically reduce native FP64 throughput on Blackwell assumes end-users will either accept the performance cliff or adopt hybrid precision techniques—yet the authors elegantly prove the latter path works at scale. The paper's elegance lies not in blaming NVIDIA, but in demonstrating that GPU tensor architecture has finally matured enough to offer superior FP8 throughput (abundant) paired with post-processing reconstruction (cheap) rather than expensive silicon devoted to rarely-used FP64 instructions. If these projections hold under real-world workload testing, this signals a tectonic shift in HPC hardware requirements and challenges GPU vendors to justify further native FP64 silicon investment.

Machine LearningDeep LearningAI HardwareScience & Research

More from NVIDIA

NVIDIANVIDIA
PRODUCT LAUNCH

HPE ProLiant Compute DL394 Gen12 Brings NVIDIA Vera CPU to Agentic AI

2026-06-08
NVIDIANVIDIA
PRODUCT LAUNCH

NVIDIA Launches Transaction Foundation Models for Financial Services with Revolut and Mastercard

2026-06-08
NVIDIANVIDIA
PARTNERSHIP

NVIDIA and LG Group Partner to Build AI Factory for Humanoid Robotics and Smart Manufacturing

2026-06-08

Comments

Suggested

MetaMeta
PRODUCT LAUNCH

Meta Launches 'Workforce Academy' to Train Workers to Build Data Centers

2026-06-08
AppleApple
PARTNERSHIP

Apple Expands Private Cloud Compute to Google Cloud with NVIDIA Partnership

2026-06-08
DoublewordDoubleword
RESEARCH

Doubleword Achieves 15% Expert Load Reduction in MoE Inference Through Input Reordering

2026-06-08
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us