BotBeat
...
← Back

> ▌

NVIDIANVIDIA
UPDATENVIDIA2026-03-06

NVIDIA CCCL 3.1 Introduces Configurable Floating-Point Determinism for GPU Reductions

Key Takeaways

  • ▸NVIDIA CCCL 3.1 introduces three configurable determinism levels for GPU reductions: not_guaranteed, run_to_run (default), and gpu_to_gpu
  • ▸The 'not_guaranteed' mode maximizes performance using atomic operations but may produce slightly different results between runs
  • ▸GPU-to-GPU determinism uses NVIDIA's Reproducible Floating-point Accumulator (RFA) to ensure bitwise-identical results across different GPU architectures, with 20-30% performance overhead
Source:
Hacker Newshttps://developer.nvidia.com/blog/controlling-floating-point-determinism-in-nvidia-cccl/↗

Summary

NVIDIA has released CUDA Core Compute Libraries (CCCL) 3.1, introducing a new single-phase API in its CUB library that gives developers explicit control over floating-point determinism in reduction operations. The update addresses a fundamental challenge in parallel computing: floating-point arithmetic isn't associative, meaning the order of operations can affect results due to rounding errors with finite precision.

Developers can now choose between three determinism levels: 'not_guaranteed' for maximum performance using atomic operations and single kernel launches, 'run_to_run' for consistent results across multiple runs on the same GPU using fixed hierarchical reduction trees, and 'gpu_to_gpu' for bitwise-identical results across different GPU architectures. The 'gpu_to_gpu' mode employs NVIDIA's Reproducible Floating-point Accumulator (RFA) technology, which groups inputs into exponent bins to ensure strict reproducibility.

The tradeoffs are clear: 'not_guaranteed' offers the fastest performance, particularly for smaller datasets, by allowing unordered atomic operations that may produce slightly different results between runs. The 'run_to_run' mode serves as the default, balancing performance with reproducibility on the same hardware. Meanwhile, 'gpu_to_run' mode sacrifices 20-30% performance on large datasets to guarantee bitwise-identical results across different GPU models, providing tighter error bounds critical for scientific computing and regulatory compliance.

This feature is only available through the new single-phase API that accepts an execution environment parameter, giving developers fine-grained control over the performance-reproducibility tradeoff based on their specific application requirements.

  • The new feature addresses floating-point non-associativity, a fundamental challenge where (a + b) + c may not equal a + (b + c) due to rounding errors
  • Configuration is only available through the new single-phase API that accepts an execution environment parameter
Machine LearningMLOps & InfrastructureAI HardwareScience & ResearchOpen Source

More from NVIDIA

NVIDIANVIDIA
POLICY & REGULATION

China Bans Nvidia RTX 5090D V2 During CEO Huang's Visit, Escalating AI Hardware Trade War

2026-05-20
NVIDIANVIDIA
PRODUCT LAUNCH

GTAP Enables Transparent Remote GPU Access: Ollama Runs on MacBook with Remote Blackwell GPU

2026-05-20
NVIDIANVIDIA
RESEARCH

Researchers Discover Critical Confused Deputy Vulnerabilities in AI Accelerators Affecting 100+ Million Devices

2026-05-19

Comments

Suggested

Helmholtz MunichHelmholtz Munich
RESEARCH

MouseMapper: AI Foundation Model Maps Systemic Damage from Obesity at Whole-Body Scale

2026-05-20
OpenAIOpenAI
RESEARCH

OpenAI Model Solves 80-Year-Old Planar Unit Distance Problem, Disproving Long-Held Mathematical Assumption

2026-05-20
NVIDIANVIDIA
POLICY & REGULATION

China Bans Nvidia RTX 5090D V2 During CEO Huang's Visit, Escalating AI Hardware Trade War

2026-05-20
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us