{"ID":2858180,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.08233","arxiv_id":"2510.08233","title":"Enhancing Reasoning for Diffusion LLMs via Distribution Matching Policy Optimization","abstract":"Diffusion large language models (dLLMs) are promising alternatives to autoregressive large language models (AR-LLMs), as they potentially allow higher inference throughput. Reinforcement learning (RL) is a crucial component for dLLMs to achieve comparable performance with AR-LLMs on important tasks, such as reasoning. However, RL algorithms that are well-suited for dLLMs' unique characteristics have yet to be developed. This paper proposes Distribution Matching Policy Optimization (DMPO), a principled and theoretically grounded RL fine-tuning method specifically designed to enhance the reasoning capabilities of dLLMs by matching the dLLM policy distribution to the optimal, reward-tilted one through cross-entropy optimization. We identify a key challenge in the implementation with a small training batch size and propose several effective solutions through a novel weight baseline subtraction technique. DMPO exhibits superior performance on multiple reasoning benchmarks without supervised fine-tuning, with an accuracy improvement of up to $54.3\\%$ over previously SOTA baselines and $66.41\\%$ over the base model, underscoring the effectiveness of the distribution matching framework. Our code is available at https://github.com/yuchen-zhu-zyc/DMPO.","short_abstract":"Diffusion large language models (dLLMs) are promising alternatives to autoregressive large language models (AR-LLMs), as they potentially allow higher inference throughput. Reinforcement learning (RL) is a crucial component for dLLMs to achieve comparable performance with AR-LLMs on important tasks, such as reasoning....","url_abs":"https://arxiv.org/abs/2510.08233","url_pdf":"https://arxiv.org/pdf/2510.08233v2","authors":"[\"Yuchen Zhu\",\"Wei Guo\",\"Jaemoo Choi\",\"Petr Molodyk\",\"Bo Yuan\",\"Molei Tao\",\"Yongxin Chen\"]","published":"2025-10-09T13:59:50Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Reinforcement Learning\",\"Diffusion Model\",\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":608522,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2858180,"paper_url":"https://arxiv.org/abs/2510.08233","paper_title":"Enhancing Reasoning for Diffusion LLMs via Distribution Matching Policy Optimization","repo_url":"https://github.com/yuchen-zhu-zyc/DMPO","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}