NVIDIA CCCL 3.1 Introduces Configurable Floating-Point Determinism for GPU Reductions
Key Takeaways
- ▸NVIDIA CCCL 3.1 introduces three configurable determinism levels for GPU reductions: not_guaranteed, run_to_run (default), and gpu_to_gpu
- ▸The 'not_guaranteed' mode maximizes performance using atomic operations but may produce slightly different results between runs
- ▸GPU-to-GPU determinism uses NVIDIA's Reproducible Floating-point Accumulator (RFA) to ensure bitwise-identical results across different GPU architectures, with 20-30% performance overhead
Summary
NVIDIA has released CUDA Core Compute Libraries (CCCL) 3.1, introducing a new single-phase API in its CUB library that gives developers explicit control over floating-point determinism in reduction operations. The update addresses a fundamental challenge in parallel computing: floating-point arithmetic isn't associative, meaning the order of operations can affect results due to rounding errors with finite precision.
Developers can now choose between three determinism levels: 'not_guaranteed' for maximum performance using atomic operations and single kernel launches, 'run_to_run' for consistent results across multiple runs on the same GPU using fixed hierarchical reduction trees, and 'gpu_to_gpu' for bitwise-identical results across different GPU architectures. The 'gpu_to_gpu' mode employs NVIDIA's Reproducible Floating-point Accumulator (RFA) technology, which groups inputs into exponent bins to ensure strict reproducibility.
The tradeoffs are clear: 'not_guaranteed' offers the fastest performance, particularly for smaller datasets, by allowing unordered atomic operations that may produce slightly different results between runs. The 'run_to_run' mode serves as the default, balancing performance with reproducibility on the same hardware. Meanwhile, 'gpu_to_run' mode sacrifices 20-30% performance on large datasets to guarantee bitwise-identical results across different GPU models, providing tighter error bounds critical for scientific computing and regulatory compliance.
This feature is only available through the new single-phase API that accepts an execution environment parameter, giving developers fine-grained control over the performance-reproducibility tradeoff based on their specific application requirements.
- The new feature addresses floating-point non-associativity, a fundamental challenge where (a + b) + c may not equal a + (b + c) due to rounding errors
- Configuration is only available through the new single-phase API that accepts an execution environment parameter


