Researchers Develop Closed-Form Formula to Predict LLM Output Sensitivity
Key Takeaways
- ▸A simple closed-form formula enables precise prediction of LLM output stability along any direction, requiring only inference-time values
- ▸The formula achieves <8% error on high-curvature directions across multiple model architectures, suggesting universal applicability
- ▸Connects three mathematical frameworks—KL divergence, loss curvature, and Fisher information—providing geometric intuition for how transformers learn and maintain stable predictions
Summary
Researchers have derived a closed-form mathematical formula that predicts how sensitive large language models are to perturbations in their residual stream—the internal vector representation that determines next-token predictions. The formula, grounded in second-order Taylor expansion of KL divergence, operates using parameters already available at inference time (softmax outputs and the unembedding matrix). When tested on the highest-curvature direction at small perturbation thresholds, the formula predicts output stability boundaries within 1% accuracy on Qwen 3-1.7B and within 8% across three transformer architectures (Qwen, Llama-3.2-1B, and Pythia-1B).
The work extends earlier observations about 'stable regions' in embedding space—plateaus where output remains unchanged despite input perturbations. The formula reveals these stability boundaries through the lens of the Hessian of next-token loss, revealing how sharply predictions curve around the current residual stream. The researchers frame the Hessian in three mathematically equivalent ways: as a second-order Taylor expansion of KL divergence, as local loss curvature, and as Fisher information geometry pulled back through the unembedding matrix. For broader applicability, isotonic calibration can recover systematic bias, achieving 50-73% predictive accuracy on larger perturbations across different architectures.
- Enables practical inference-time robustness analysis without requiring model retraining or expensive perturbation sampling
Editorial Opinion
This research represents an elegant mathematical contribution to understanding transformer internals. By deriving a closed-form solution to what seemed like an intractable empirical problem, the authors provide both theoretical insight and practical utility. This work could enable new approaches to LLM evaluation, adversarial robustness testing, and mechanistic interpretability, making it a valuable tool for practitioners building safer and more reliable language models.



