{"ID":2860189,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.04019","arxiv_id":"2510.04019","title":"Simple Policy Gradients for Reasoning with Diffusion Language Models","abstract":"Diffusion large language models (dLLMs), which offer a promising alternative to traditional autoregressive LLMs, have recently shown strong results in pretraining. However, due to their lack of tractable sequence-level likelihoods, they have yet to benefit from modern LLM post-training techniques such as reinforcement learning (RL), limiting their real-world applicability. Existing attempts at dLLM post-training rely on heuristic approximations or lower bounds of the true likelihood. In this work, we propose Amortized Group Relative Policy Optimization (AGRPO), a policy gradient algorithm that leverages the multi-step Markovian nature of dLLM generation, optimizing individual denoising steps rather than entire sequences. We demonstrate AGRPO's effectiveness on different math and reasoning tasks, achieving +9.9\\% absolute gain on GSM8K, +4.6\\% on MATH-500, +59.4\\% on Countdown, and +69.7\\% on Sudoku over the base LLaDA model, improving upon comparable dLLM RL methods such as diffu-GRPO. Furthermore, we analyze how post-training gains persist across different inference configurations, revealing that models trained with AGRPO can sample 4x faster with minimal performance sacrifices.","short_abstract":"Diffusion large language models (dLLMs), which offer a promising alternative to traditional autoregressive LLMs, have recently shown strong results in pretraining. However, due to their lack of tractable sequence-level likelihoods, they have yet to benefit from modern LLM post-training techniques such as reinforcement...","url_abs":"https://arxiv.org/abs/2510.04019","url_pdf":"https://arxiv.org/pdf/2510.04019v2","authors":"[\"Anthony Zhan\"]","published":"2025-10-05T03:53:16Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.CL\"]","methods":"[\"Reinforcement Learning\",\"Diffusion Model\",\"Large Language Model\",\"Language Model\"]","has_code":false}
