RadixArk Achieves Thousand-Scale LoRA Adapter Training with Extended Miles Framework
Key Takeaways
- ▸Successfully trains 1,536 LoRA adapters concurrently on a single base model with sub-3-minute training steps
- ▸Eliminates VRAM duplication by sharing frozen base model and routing tokens to different lightweight task-specific adapters
- ▸Enables efficient large-scale RL experimentation: thousands of policy variants can now be tested and compared within the same training loop
Summary
RadixArk has extended Miles, its open-source RL post-training framework, with a multi-adapter LoRA training system that enables concurrent training of thousands of LoRA adapters on a single shared base model. By modifying Megatron-Bridge and implementing multi-LoRA routing through SGLang, the team demonstrates capability to train 1,536 LoRA adapter instances simultaneously with step times under 3 minutes on a Qwen3.6-35B model, validating the approach on GSM8K benchmarks.
The breakthrough addresses a critical infrastructure bottleneck in scaling RL experiments: traditionally, training multiple LoRA adapters requires replicating the entire base model for each concurrent run, wasting substantial VRAM. The new approach shares a single base model across all adapters while routing tokens to different task-specific LoRA deltas, enabling researchers to explore thousands of policy variations (prompt design, reward signals, curriculum ablations) within a single training step.
Implementation details include online adapter loading and unloading without trainer restarts, multi-LoRA rollouts via SGLang's native interface, unified FP8 training support, and memory optimization through adapter-free expert design. This architectural approach transforms LoRA from a single-policy fine-tuning technique into a platform for large-scale parallel policy exploration.
- Built on Megatron-Bridge and SGLang with online adapter lifecycle management and memory optimization for expert layers
Editorial Opinion
This is a meaningful systems contribution that democratizes large-scale RL policy exploration. The elegance of sharing a frozen base model while routing through lightweight task-specific adapters should become standard practice for RL infrastructure. For teams exploring extensive hyperparameter and design spaces—common in frontier model training—this reduces compute waste significantly. However, adoption and real-world impact depend on community uptake of Miles and validation beyond the GSM8K stress test on production RL workloads.


