{"ID":2853846,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.15242","arxiv_id":"2510.15242","title":"Bradley-Terry Policy Optimization for Generative Preference Modeling","abstract":"Reinforcement learning (RL) has recently proven effective at scaling chain-of-thought (CoT) reasoning in large language models for tasks with verifiable answers. However, extending RL-based thought training to more general non-verifiable tasks-where supervision is provided only through pairwise human preferences-remains challenging. Existing approaches typically apply RL objectives designed for verifiable rewards to preference-based settings in a heuristic manner. In this work, we show that introducing CoT reasoning into preference modeling fundamentally changes the structure of the Bradley-Terry (BT) likelihood, as the reasoning process must be treated as a latent variable. This results in a preference likelihood expressed as a ratio of expectations over stochastic generation trajectories, which cannot be optimized using Jensen-style bounds or standard RL objectives. To address this challenge, we derive a consistent Monte Carlo estimator for the gradient of the resulting likelihood, leading to Bradley-Terry Policy Optimization (BTPO). Empirically, BTPO enables stable and effective training of generative preference models with CoT reasoning, consistently outperforming prior heuristic approaches across multiple benchmarks and model scales.","short_abstract":"Reinforcement learning (RL) has recently proven effective at scaling chain-of-thought (CoT) reasoning in large language models for tasks with verifiable answers. However, extending RL-based thought training to more general non-verifiable tasks-where supervision is provided only through pairwise human preferences-remain...","url_abs":"https://arxiv.org/abs/2510.15242","url_pdf":"https://arxiv.org/pdf/2510.15242v3","authors":"[\"Shengyu Feng\",\"Yun He\",\"Shuang Ma\",\"Beibin Li\",\"Yuanhao Xiong\",\"Songlin Li\",\"Karishma Mandyam\",\"Julian Katz-Samuels\",\"Shengjie Bi\",\"Licheng Yu\",\"Hejia Zhang\",\"Karthik Abinav Sankararaman\",\"Han Fang\",\"Yiming Yang\",\"Manaal Faruqui\"]","published":"2025-10-17T02:14:24Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Reinforcement Learning\",\"Language Model\"]","has_code":false}
