{"ID":2896190,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.07855","arxiv_id":"2507.07855","title":"DPO Unchained: Your Training Algorithm is Secretly Disentangled in Human Choice Theory","abstract":"Normative theories allow one to elicit key parts of a ML algorithm from first principles, which is crucial at a time of championed scrutiny for ML work. Direct Preference Optimization (DPO) cleverly bypasses reward modeling by making an explicit link with a specific normative model of human choice. Our paper elevates this connection to the full generality of DPO's normative framework. Getting there requires reworking human choice theory's textbook path for a better RLHF/ML fit. It elevates the connection to a remarkably broad viewpoint on preference optimization, considering the current panorama of DPO follow-ups. It also unveils unexpected riches for ML, chief among which the support for non-convex losses, the fact that any compliant ML analytical choice can be embedded with any human choice model, and a normative framework's umbrella wide enough to safeguard DPO's extensions (margins, length correction, ...). A toy experiment ``far away'' from the DPO crowd is given.","short_abstract":"Normative theories allow one to elicit key parts of a ML algorithm from first principles, which is crucial at a time of championed scrutiny for ML work. Direct Preference Optimization (DPO) cleverly bypasses reward modeling by making an explicit link with a specific normative model of human choice. Our paper elevates t...","url_abs":"https://arxiv.org/abs/2507.07855","url_pdf":"https://arxiv.org/pdf/2507.07855v3","authors":"[\"Wenxuan Zhou\",\"Shujian Zhang\",\"Brice Magdalou\",\"John Lambert\",\"Ehsan Amid\",\"Richard Nock\",\"Andrew Hard\"]","published":"2025-07-10T15:38:17Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.CL\"]","methods":"[\"RLHF\"]","has_code":false}
