{"ID":2833741,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.04332","arxiv_id":"2512.04332","title":"Data-regularized Reinforcement Learning for Diffusion Models at Scale","abstract":"Aligning generative diffusion models with human preferences via reinforcement learning (RL) is critical yet challenging. Most existing algorithms are often vulnerable to reward hacking, such as quality degradation, over-stylization, or reduced diversity. Our analysis demonstrates that this can be attributed to the inherent limitations of their regularization, which provides unreliable penalties. We introduce Data-regularized Diffusion Reinforcement Learning (DDRL), a novel framework that uses the forward KL divergence to anchor the policy to an off-policy data distribution. Theoretically, DDRL enables robust, unbiased integration of RL with standard diffusion training. Empirically, this translates into a simple yet effective algorithm that combines reward maximization with diffusion loss minimization. With over a million GPU hours of experiments and ten thousand double-blind human evaluations, we demonstrate on high-resolution video generation tasks that DDRL significantly improves rewards while alleviating the reward hacking seen in baselines, achieving the highest human preference and establishing a robust and scalable paradigm for diffusion post-training.","short_abstract":"Aligning generative diffusion models with human preferences via reinforcement learning (RL) is critical yet challenging. Most existing algorithms are often vulnerable to reward hacking, such as quality degradation, over-stylization, or reduced diversity. Our analysis demonstrates that this can be attributed to the inhe...","url_abs":"https://arxiv.org/abs/2512.04332","url_pdf":"https://arxiv.org/pdf/2512.04332v3","authors":"[\"Haotian Ye\",\"Kaiwen Zheng\",\"Jiashu Xu\",\"Puheng Li\",\"Huayu Chen\",\"Jiaqi Han\",\"Sheng Liu\",\"Qinsheng Zhang\",\"Hanzi Mao\",\"Zekun Hao\",\"Prithvijit Chattopadhyay\",\"Dinghao Yang\",\"Liang Feng\",\"Maosheng Liao\",\"Junjie Bai\",\"Ming-Yu Liu\",\"James Zou\",\"Stefano Ermon\"]","published":"2025-12-03T23:45:07Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Reinforcement Learning\",\"Diffusion Model\"]","has_code":false}
