Survey of 16 Open-Source RL Libraries Reveals Async Training as Post-Training Paradigm
Key Takeaways
- ▸Asynchronous RL training—disaggregating inference and training onto separate GPU pools—has become essential for scaling post-training as rollout lengths grow exponentially
- ▸Ray and NCCL broadcast are the dominant orchestration and weight synchronization standards, with distributed MoE support emerging as the next key differentiator
- ▸Modern RL training faces new challenges including critic-free algorithms, process reward models, and multi-agent co-evolution that complicate async architecture design
Summary
A comprehensive survey analyzing 16 open-source reinforcement learning libraries reveals that asynchronous training—separating inference and training onto different GPU pools—has become the dominant architecture for large-scale post-training. The research, authored by Kashif Rasul, addresses a critical bottleneck in synchronous RL training where data generation from long rollouts (particularly from reasoning models and tool-use agents) causes training GPUs to sit idle up to 60% of the time. The study compares implementations across seven key axes: orchestration primitives, buffer design, weight synchronization protocols, staleness management, partial rollout handling, LoRA support, and distributed training backends.
Key findings show that Ray dominates orchestration primitives across surveyed libraries, while NCCL broadcast has emerged as the standard for asynchronous weight transfer. The research identifies sparse LoRA training support and emerging distributed Mixture of Experts (MoE) support as critical differentiators. The paper also outlines emerging challenges in async RL architectures, including critic-free algorithms that reduce memory but increase weight sync pressure, process reward models introducing new synchronization barriers, and training-inference mismatches exemplified by models like DeepSeek v3.2. The findings inform Hugging Face's design of TRL's Async Trainer, which prioritizes lightweight orchestration, NCCL-based weight synchronization, and support for partial rollouts in agentic workloads.
- LoRA training support remains sparse across surveyed libraries, indicating a gap between current async implementations and fine-tuning use cases
Editorial Opinion
This survey provides valuable guidance for practitioners scaling RL training, but the fragmentation across 16 different library implementations highlights the immaturity of async RL orchestration as a field. The emergence of new challenges—critic-free algorithms, process rewards, and agentic co-evolution—suggests that current async patterns may not be future-proof; standardization around a reference architecture could accelerate adoption and reduce the engineering burden on teams building large-scale post-training systems.



