NVIDIA CCCL 3.1 Introduces Configurable Floating-Point Determinism for GPU Reductions

Key Takeaways

▸NVIDIA CCCL 3.1 introduces three configurable determinism levels for GPU reductions: not_guaranteed, run_to_run (default), and gpu_to_gpu
▸The 'not_guaranteed' mode maximizes performance using atomic operations but may produce slightly different results between runs
▸GPU-to-GPU determinism uses NVIDIA's Reproducible Floating-point Accumulator (RFA) to ensure bitwise-identical results across different GPU architectures, with 20-30% performance overhead

Source:

Hacker Newshttps://developer.nvidia.com/blog/controlling-floating-point-determinism-in-nvidia-cccl/↗

Summary

NVIDIA has released CUDA Core Compute Libraries (CCCL) 3.1, introducing a new single-phase API in its CUB library that gives developers explicit control over floating-point determinism in reduction operations. The update addresses a fundamental challenge in parallel computing: floating-point arithmetic isn't associative, meaning the order of operations can affect results due to rounding errors with finite precision.

Developers can now choose between three determinism levels: 'not_guaranteed' for maximum performance using atomic operations and single kernel launches, 'run_to_run' for consistent results across multiple runs on the same GPU using fixed hierarchical reduction trees, and 'gpu_to_gpu' for bitwise-identical results across different GPU architectures. The 'gpu_to_gpu' mode employs NVIDIA's Reproducible Floating-point Accumulator (RFA) technology, which groups inputs into exponent bins to ensure strict reproducibility.

The tradeoffs are clear: 'not_guaranteed' offers the fastest performance, particularly for smaller datasets, by allowing unordered atomic operations that may produce slightly different results between runs. The 'run_to_run' mode serves as the default, balancing performance with reproducibility on the same hardware. Meanwhile, 'gpu_to_run' mode sacrifices 20-30% performance on large datasets to guarantee bitwise-identical results across different GPU models, providing tighter error bounds critical for scientific computing and regulatory compliance.

This feature is only available through the new single-phase API that accepts an execution environment parameter, giving developers fine-grained control over the performance-reproducibility tradeoff based on their specific application requirements.

The new feature addresses floating-point non-associativity, a fundamental challenge where (a + b) + c may not equal a + (b + c) due to rounding errors
Configuration is only available through the new single-phase API that accepts an execution environment parameter

NVIDIA

UPDATE NVIDIA2026-03-06

NVIDIA CCCL 3.1 Introduces Configurable Floating-Point Determinism for GPU Reductions

Key Takeaways

▸NVIDIA CCCL 3.1 introduces three configurable determinism levels for GPU reductions: not_guaranteed, run_to_run (default), and gpu_to_gpu
▸The 'not_guaranteed' mode maximizes performance using atomic operations but may produce slightly different results between runs
▸GPU-to-GPU determinism uses NVIDIA's Reproducible Floating-point Accumulator (RFA) to ensure bitwise-identical results across different GPU architectures, with 20-30% performance overhead

Source:

Hacker Newshttps://developer.nvidia.com/blog/controlling-floating-point-determinism-in-nvidia-cccl/↗

Summary

The new feature addresses floating-point non-associativity, a fundamental challenge where (a + b) + c may not equal a + (b + c) due to rounding errors
Configuration is only available through the new single-phase API that accepts an execution environment parameter

NVIDIA CCCL 3.1 Introduces Configurable Floating-Point Determinism for GPU Reductions

Key Takeaways

Summary

More from NVIDIA

NVIDIA Launches Cloud Functions Platform for GPU-Accelerated Workload Deployment at Scale

NVIDIA Launches Blackwell GPU Optimization Series: First Comprehensive Guide to Matrix Multiplication Kernels

Singapore Seizes $42M Mansion in NVIDIA Chip Smuggling Crackdown

Comments

Suggested

Stanford Researchers Use Multi-Agent AI and Reinforcement Learning to Improve HIP Kernel Generation for AMD GPUs

Researchers Expose Critical Payload-Less Attack on LLM Agent Supply Chains

Meta Acknowledges AI Agent Development Slower Than Expected, Despite $145B Infrastructure Investment

NVIDIA CCCL 3.1 Introduces Configurable Floating-Point Determinism for GPU Reductions

Key Takeaways

Summary

More from NVIDIA

NVIDIA Launches Cloud Functions Platform for GPU-Accelerated Workload Deployment at Scale

NVIDIA Launches Blackwell GPU Optimization Series: First Comprehensive Guide to Matrix Multiplication Kernels

Singapore Seizes $42M Mansion in NVIDIA Chip Smuggling Crackdown

Comments

Suggested

Stanford Researchers Use Multi-Agent AI and Reinforcement Learning to Improve HIP Kernel Generation for AMD GPUs

Researchers Expose Critical Payload-Less Attack on LLM Agent Supply Chains

Meta Acknowledges AI Agent Development Slower Than Expected, Despite $145B Infrastructure Investment