Sophia: New Second-Order Optimizer Achieves 2x Speedup in Language Model Training
Key Takeaways
- ▸Sophia achieves 2x speedup in training steps, total compute, and wall-clock time compared to Adam on language models
- ▸The optimizer combines diagonal Hessian estimation with element-wise clipping, enabling scalability without prohibitive computational overhead
- ▸Results demonstrate that sophisticated second-order optimization can be practically viable for large-scale language model pre-training, potentially reducing training costs significantly
Summary
Researchers have introduced Sophia, a scalable second-order optimizer designed to significantly improve the efficiency of language model pre-training. The optimizer uses a lightweight estimate of the diagonal Hessian as a preconditioner, combined with element-wise clipping to control update sizes and manage the complexities of non-convex optimization. Unlike more sophisticated second-order methods that incur substantial per-step overhead, Sophia estimates the diagonal Hessian only every few iterations, keeping computational costs minimal.
In extensive experiments with GPT models ranging from 125M to 1.5B parameters, Sophia demonstrated a 2x speedup compared to Adam across multiple metrics: achieving the same perplexity in 50% fewer training steps, with reduced total compute requirements and wall-clock time. The clipping mechanism proves critical for controlling worst-case update sizes and mitigating the negative impacts of rapid Hessian changes during training. Theoretically, the researchers show that Sophia adapts to heterogeneous curvatures across parameter dimensions, yielding runtime bounds that are independent of the loss function's condition number.
Editorial Opinion
Sophia represents a meaningful advance in optimization for large-scale deep learning, addressing a critical pain point in language model development—the enormous computational cost of pre-training. The achievement of 2x speedup through elegant algorithmic design rather than simply scaling hardware suggests there is still substantial room for optimization innovation in the field. If these results generalize to even larger models and production settings, Sophia could have material economic and environmental implications for AI development.



