New Research Reveals Complex Singularities Behind Neural Network Training Instability
Key Takeaways
- ▸Training instabilities occur when optimization steps exceed the Taylor series convergence radius, which is limited by complex zeros of the softmax partition function—not just local Hessian curvature
- ▸A new radius-based step-size controller can be incorporated into standard optimizers to automatically adapt step sizes based on local geometric safety criteria
- ▸The approach provides closed-form, interpretable estimates of safe step sizes using directional logit derivatives, offering a fundamentally different perspective from traditional smoothness-based analysis
Summary
A new paper titled "Ghosts of Softmax: Complex Singularities That Limit Safe Step Sizes in Cross-Entropy" addresses a fundamental gap in deep learning optimization theory. The research identifies that training instabilities occur when optimization steps exceed the radius of convergence of the loss function's Taylor expansion, which is determined by complex zeros of the softmax partition function. Rather than relying on traditional approaches like Hessian-based smoothness analysis, the authors use complex analysis to estimate safe step sizes directly from the geometry of the loss landscape.
The work introduces a practical radius-based step-size controller that can be integrated into standard optimizers (SGD, momentum SGD, Adam) to prevent training collapse. The controller ensures proposed updates remain within the local convergence radius, rescaling steps when necessary. The authors provide comprehensive tutorials, notebooks, and reproducible experiment scripts demonstrating how directional logit derivatives can bound the convergence radius and why this approach differs fundamentally from existing smoothness criteria. Experimental results show that all tested architectures collapse once the normalized step size exceeds 1, validating the theoretical predictions.
- Open-source tutorials and reproducible code enable practitioners to estimate convergence radii and implement the controller in their own training pipelines
Editorial Opinion
This work addresses a critical blind spot in deep learning optimization: the fact that local Taylor models often guide steps well outside their radius of validity. By connecting neural network training instability to complex analysis and partition function zeros, the authors provide both theoretical insight and practical tools. The accessibility of the open-source repository—complete with tutorials and optimizer integrations—could make this methodology widely adoptable, potentially reducing training failures across the field.



