cuTile Rust: Safe GPU Kernel Programming Brings Memory Safety to NVIDIA Acceleration
Key Takeaways
- ▸cuTile Rust eliminates the need for unsafe code in GPU kernel development by extending Rust's ownership and memory safety model to tile-based GPU programming
- ▸Performance benchmarks match cuBLAS at 96% efficiency on GEMM and maintain near-parity with hand-optimized CUDA code, proving safety doesn't mandate performance sacrifice
- ▸Grout, a cuTile Rust-based LLM inference engine, achieves throughput competitive with vLLM and SGLang on real-world Qwen3 inference tasks
Summary
A new research system called cuTile Rust extends Rust's ownership guarantees and memory safety features to GPU kernel development—a domain where developers have traditionally been forced to abandon Rust's protections for raw performance. The system enables safe, idiomatic GPU programming while maintaining competitive performance with hand-optimized CUDA code.
The research demonstrates impressive benchmarks on high-end NVIDIA hardware. On the B200 GPU, cuTile Rust achieves 7 TB/s for element-wise operations and 2 PFlop/s for GEMM at 96% of cuBLAS performance. A proof-of-concept inference engine called Grout, built with cuTile Rust, delivers competitive throughput with established frameworks like vLLM and SGLang, reaching 171 generated tokens/s for Qwen3-4B on an RTX 5090 and 82 generated tokens/s for Qwen3-32B on the B200.
The system introduces mutable output tile splitting, host-side ownership preservation, and optional low-level escape hatches alongside a composable execution model spanning synchronous launches, asynchronous pipelines, and CUDA graph replay. This bridges the long-standing gap between GPU programming flexibility and memory safety guarantees that Rust provides on the CPU.
- The system preserves composable execution across multiple paradigms (sync/async launches, CUDA graphs) while allowing developers to trade off control and safety locally
Editorial Opinion
This research addresses a critical pain point in GPU computing: the false choice between safety and performance. If cuTile Rust matures into a widely-adopted tool, it could accelerate GPU software development by reducing memory-safety bugs and opening GPU programming to developers without deep CUDA expertise. The near-identical performance with hand-optimized code is particularly impressive and suggests Rust's abstractions need not be a bottleneck. Success will depend on ecosystem adoption and whether the ergonomic benefits outweigh the learning curve for existing CUDA developers.



