{"ID":2896112,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.07723","arxiv_id":"2507.07723","title":"Stable Preference Optimization: A Bilevel Approach to Catastrophic Preference Shift","abstract":"Direct Preference Learning has emerged as a dominant offline paradigm for preference optimization. Most of these methods are based on the Bradley-Terry (BT) model for pairwise preference ranking, which directly aligns language model with human preference. Prior work has observed a counter-intuitive phenomenon termed likelihood displacement, where the absolute probability of preferred responses decreases simultaneously during training. We demonstrate that such displacement can lead to a more devastating failure mode, which we defined as \\textit{Catastrophic Preference Shift}, where the lost preference probability mass inadvertently shifts toward out-of-distribution (OOD) responses. Such a failure mode is a key limitation shared across BT-style direct preference learning methods, due to the fundamental conflict between the unconstrained discriminative alignment and generative foundational capabilities, ultimately leading to severe performance degradation (e.g., SimPO suffers a significant drop in reasoning accuracy from 73.5\\% to 37.5\\%). We analyze existing BT-style methods from the probability evolution perspective and theoretically prove that these methods exhibit over-reliance on model initialization and can lead to preference shift. To resolve these counter-intuitive behaviors, we propose a theoretically grounded Stable Preference Optimization (SPO) framework that constrains preference learning within a safe alignment region. Empirical evaluations demonstrate that SPO effectively stabilizes and enhances the performance of existing BT-style preference learning methods. SPO provides new insights into the design of preference learning objectives and opens up new avenues towards more reliable and interpretable language model alignment.","short_abstract":"Direct Preference Learning has emerged as a dominant offline paradigm for preference optimization. Most of these methods are based on the Bradley-Terry (BT) model for pairwise preference ranking, which directly aligns language model with human preference. Prior work has observed a counter-intuitive phenomenon termed li...","url_abs":"https://arxiv.org/abs/2507.07723","url_pdf":"https://arxiv.org/pdf/2507.07723v2","authors":"[\"Chengtao Jian\",\"Kai Yang\",\"Tianhao Gao\",\"Wuguang Ni\",\"Keying Yang\",\"Bowen Xiao\",\"Jiajun Liu\",\"Ye Ouyang\"]","published":"2025-07-10T12:57:39Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CL\",\"cs.LG\"]","methods":"[\"Language Model\"]","has_code":false}
