{"ID":2866141,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.21474","arxiv_id":"2509.21474","title":"d2: Improved Techniques for Training Reasoning Diffusion Language Models","abstract":"While diffusion language models (DLMs) have achieved competitive performance in text generation, improving their reasoning ability with reinforcement learning remains an active research area. Here, we introduce d2, a reasoning framework tailored for masked DLMs. Central to our framework is a new policy gradient algorithm that relies on accurate estimates of the sampling trajectory likelihoods. Our likelihood estimator, d2-AnyOrder, achieves exact trajectory likelihood with a single model pass for DLMs that support a sampling algorithm called any-order decoding. Through an empirical study of widely used DLMs, we show that any-order decoding is not universally supported in practice. Consequently, for DLMs that do not naturally support any-order decoding, we propose another estimator, d2-StepMerge, which, unlike d2-AnyOrder, only approximates the trajectory likelihood. d2-StepMerge trades off compute for approximation accuracy in an analytically tractable manner. Empirically, d2 significantly outperforms widely-used RL baselines when applied to popular DLMs, and sets a new state-of-the-art performance for DLMs on logical reasoning tasks (Countdown and Sudoku) and math reasoning benchmarks (GSM8K and MATH500). We provide the code along with a blog post on the project page: https://guanghanwang.com/d2","short_abstract":"While diffusion language models (DLMs) have achieved competitive performance in text generation, improving their reasoning ability with reinforcement learning remains an active research area. Here, we introduce d2, a reasoning framework tailored for masked DLMs. Central to our framework is a new policy gradient algorit...","url_abs":"https://arxiv.org/abs/2509.21474","url_pdf":"https://arxiv.org/pdf/2509.21474v3","authors":"[\"Guanghan Wang\",\"Gilad Turok\",\"Yair Schiff\",\"Marianne Arriola\",\"Volodymyr Kuleshov\"]","published":"2025-09-25T19:40:36Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Reinforcement Learning\",\"Diffusion Model\",\"Language Model\"]","project_urls":"[\"https://guanghanwang.com/d2\"]","has_code":false,"code_links":[{"ID":609344,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2866141,"paper_url":"https://arxiv.org/abs/2509.21474","paper_title":"d2: Improved Techniques for Training Reasoning Diffusion Language Models","repo_url":"https://github.com/kuleshov-group/d2","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0},{"ID":609345,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2866141,"paper_url":"https://arxiv.org/abs/2509.21474","paper_title":"d2: Improved Techniques for Training Reasoning Diffusion Language Models","repo_url":"https://github.com/eliahuhorwitz/Academic-project-page-template","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
