New Research Reveals How Large Language Models Develop Value Alignment During Training
Key Takeaways
- ▸Supervised fine-tuning (SFT) is the primary stage where LLMs establish their core value alignment; later preference optimization has minimal re-alignment effects
- ▸Different preference optimization algorithms lead to divergent value alignment outcomes independently of the training data used
- ▸The timing and magnitude of 'value drifts' during post-training can be measured and analyzed to inform better model alignment practices
Summary
A new research paper titled "Value Drifts: Tracing Value Alignment During LLM Post-Training" investigates how large language models learn to align with human values during the post-training phase. The study, which analyzed models including Llama-3 and Qwen-3, tracked when and how value alignment emerges through supervised fine-tuning (SFT) and preference optimization algorithms. The researchers discovered that the SFT phase is critical for establishing a model's foundational values, while subsequent preference optimization has limited ability to significantly alter these values. The research also found that different preference optimization algorithms produce varying alignment outcomes even when trained on identical preference data, suggesting that algorithm selection plays a crucial role in shaping model behavior.
- Findings provide actionable guidance for data curation and algorithm selection to improve LLM alignment with human values
Editorial Opinion
This research addresses a critical gap in LLM alignment research by moving beyond static evaluations of fully-trained models to examine the dynamic process of value learning. The finding that SFT establishes foundational values while preference optimization has limited re-alignment capacity suggests that practitioners should focus alignment efforts earlier in training rather than relying on final-stage preference optimization. The discovery that algorithm choice matters independently of data quality is particularly valuable, as it provides a new lever for improving model alignment without requiring extensive dataset curation.



