Muon Optimizer Breaks Neural Network Training Speed Records with Novel Orthogonalization Approach
Key Takeaways
- ▸Muon uses Newton-Schulz iteration to orthogonalize gradient updates, achieving 1.35x faster training on NanoGPT benchmarks and breaking the CIFAR-10 speed record
- ▸The optimizer specifically targets 2D parameters in hidden layers, addressing the high condition number problem in standard optimizer updates
- ▸Muon reduced training time for a 1.5B parameter model to GPT-2 XL performance from 13.3 hours to 10 hours on 8xH100 GPUs
Summary
Researcher Keller Jordan has introduced Muon, a novel optimizer specifically designed for hidden layers in neural networks that uses Newton-Schulz iteration to orthogonalize update matrices. The optimizer has demonstrated significant performance improvements across multiple benchmarks, including breaking the CIFAR-10 training speed record by reducing training time from 3.3 to 2.6 A100-seconds to reach 94% accuracy, and improving NanoGPT speedrunning performance by 1.35x. Muon works by taking updates generated by SGD-momentum and applying a Newton-Schulz iteration as post-processing to approximately orthogonalize each update matrix before applying them to parameters.
The key innovation behind Muon is its approach to addressing the high condition number typically found in updates produced by standard optimizers like SGD-momentum and Adam for 2D parameters in transformer-based neural networks. By orthogonalizing these updates—effectively replacing them with the nearest semi-orthogonal matrix—Muon increases the scale of "rare directions" that have small magnitude but are important for learning. The optimizer has shown continued improvements while scaling to larger models, including training a 1.5B parameter transformer to GPT-2 XL level performance on HellaSwag in 10 hours on 8xH100 GPUs, compared to 13.3 hours with AdamW.
Muon is designed to work specifically with 2D parameters in neural network hidden layers, while scalar, vector parameters, and input/output layers should continue using standard optimizers like AdamW. The optimizer can also handle 4D convolutional parameters by flattening their last three dimensions. A PyTorch implementation has been made publicly available, and the optimizer has already been integrated into record-breaking speedrun attempts for both NanoGPT and CIFAR-10 training tasks.
- Open-source PyTorch implementation is available and has been successfully integrated into competitive speedrunning tasks
- The optimizer continues showing improvements when scaling to larger models (774M and 1.5B parameters)
Editorial Opinion
Muon represents an intriguing departure from the incremental optimizer improvements we've seen in recent years, targeting a specific architectural component rather than attempting to be a universal solution. The focus on orthogonalizing updates for hidden layers is theoretically elegant and empirically validated, though the requirement to pair it with traditional optimizers for other parameter types adds complexity. While the speedup results are impressive, particularly at scale, the research community will be watching closely to see if these gains hold across diverse architectures beyond transformers and whether the added implementation complexity becomes a barrier to widespread adoption.



