{"ID":2906868,"CreatedAt":"2026-06-01T11:42:32.101213702Z","UpdatedAt":"2026-06-07T06:37:52.911886358Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2605.29398","arxiv_id":"2605.29398","title":"GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models","abstract":"Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered by the intractability of the policy likelihood. A dominant and efficient family of methods replaces the likelihood in standard RL with its evidence lower bound (ELBO), estimated from randomly masked sequences. Despite being well aligned with pre-training, these approaches introduce bias through training--inference mismatch by using the ELBO as a likelihood surrogate, which can degrade performance. In this work, we propose Guided Denoiser Self-Distillation (GDSD) to directly distill the denoiser of dLLMs from an advantage-guided self-teacher, derived from the closed-form optimum of reverse-KL regularized RL. GDSD matches the dLLM's denoiser logits to the teacher's via a normalization-free objective, which reduces RL to likelihood-free self-distillation and thus bypasses the TIM biases. Recent ELBO-based methods emerge as instances of applying different distillation divergences, but with diagnosable pathologies that GDSD avoids. On planning, math, and coding benchmarks with LLaDA-8B and Dream-7B, GDSD consistently outperforms prior state-of-the-art ELBO-based methods with a more stable training reward dynamics, achieving test-accuracy improvements of up to $+19.6\\%$. These results suggest that direct denoiser self-distillation, without relying on an ELBO likelihood surrogate, can provide a more stable and effective RL procedure for dLLMs. Code is available at https://github.com/GaryBall/GDSD.","short_abstract":"Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered by the intractability of the policy likelihood. A dominant and efficient family of methods replaces the likelihood in standard RL with its evidence lower bound (ELBO), estimated from...","url_abs":"https://arxiv.org/abs/2605.29398","url_pdf":"https://arxiv.org/pdf/2605.29398v1","authors":"[\"Xiaohang Tang\",\"Keyue Jiang\",\"Che Liu\",\"Qifang Zhao\",\"Xiaoxiao Xu\",\"Sangwoong Yoon\",\"Ilija Bogunovic\"]","published":"2026-05-28T05:47:40Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Reinforcement Learning\",\"Diffusion Model\",\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":612549,"CreatedAt":"2026-06-01T11:42:32.101213702Z","UpdatedAt":"2026-06-01T11:42:32.101213702Z","DeletedAt":null,"paper_id":2906868,"paper_url":"https://arxiv.org/abs/2605.29398","paper_title":"GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models","repo_url":"https://github.com/GaryBall/GDSD","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
