BotBeat
...
← Back

> ▌

NVIDIANVIDIA
UPDATENVIDIA2026-03-06

NVIDIA CCCL 3.1 Introduces Configurable Floating-Point Determinism for GPU Reductions

Key Takeaways

  • ▸NVIDIA CCCL 3.1 introduces three configurable determinism levels for GPU reductions: not_guaranteed, run_to_run (default), and gpu_to_gpu
  • ▸The 'not_guaranteed' mode maximizes performance using atomic operations but may produce slightly different results between runs
  • ▸GPU-to-GPU determinism uses NVIDIA's Reproducible Floating-point Accumulator (RFA) to ensure bitwise-identical results across different GPU architectures, with 20-30% performance overhead
Source:
Hacker Newshttps://developer.nvidia.com/blog/controlling-floating-point-determinism-in-nvidia-cccl/↗

Summary

NVIDIA has released CUDA Core Compute Libraries (CCCL) 3.1, introducing a new single-phase API in its CUB library that gives developers explicit control over floating-point determinism in reduction operations. The update addresses a fundamental challenge in parallel computing: floating-point arithmetic isn't associative, meaning the order of operations can affect results due to rounding errors with finite precision.

Developers can now choose between three determinism levels: 'not_guaranteed' for maximum performance using atomic operations and single kernel launches, 'run_to_run' for consistent results across multiple runs on the same GPU using fixed hierarchical reduction trees, and 'gpu_to_gpu' for bitwise-identical results across different GPU architectures. The 'gpu_to_gpu' mode employs NVIDIA's Reproducible Floating-point Accumulator (RFA) technology, which groups inputs into exponent bins to ensure strict reproducibility.

The tradeoffs are clear: 'not_guaranteed' offers the fastest performance, particularly for smaller datasets, by allowing unordered atomic operations that may produce slightly different results between runs. The 'run_to_run' mode serves as the default, balancing performance with reproducibility on the same hardware. Meanwhile, 'gpu_to_run' mode sacrifices 20-30% performance on large datasets to guarantee bitwise-identical results across different GPU models, providing tighter error bounds critical for scientific computing and regulatory compliance.

This feature is only available through the new single-phase API that accepts an execution environment parameter, giving developers fine-grained control over the performance-reproducibility tradeoff based on their specific application requirements.

  • The new feature addresses floating-point non-associativity, a fundamental challenge where (a + b) + c may not equal a + (b + c) due to rounding errors
  • Configuration is only available through the new single-phase API that accepts an execution environment parameter
Machine LearningMLOps & InfrastructureAI HardwareScience & ResearchOpen Source

More from NVIDIA

NVIDIANVIDIA
RESEARCH

Nvidia Pivots to Optical Interconnects as Copper Hits Physical Limits, Plans 1,000+ GPU Systems by 2028

2026-04-05
NVIDIANVIDIA
PRODUCT LAUNCH

NVIDIA Introduces Nemotron 3: Open-Source Family of Efficient AI Models with Up to 1M Token Context

2026-04-03
NVIDIANVIDIA
PRODUCT LAUNCH

NVIDIA Claims World's Lowest Cost Per Token for AI Inference

2026-04-03

Comments

Suggested

Google / AlphabetGoogle / Alphabet
RESEARCH

Deep Dive: Optimizing Sharded Matrix Multiplication on TPU with Pallas

2026-04-05
GitHubGitHub
PRODUCT LAUNCH

GitHub Launches Squad: Open Source Multi-Agent AI Framework to Simplify Complex Workflows

2026-04-05
NVIDIANVIDIA
RESEARCH

Nvidia Pivots to Optical Interconnects as Copper Hits Physical Limits, Plans 1,000+ GPU Systems by 2028

2026-04-05
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us