{"ID":2834686,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.02019","arxiv_id":"2512.02019","title":"Diffusion-Augmented Markov Decision Processes for Maximum Entropy Reinforcement Learning","abstract":"Diffusion models excel at sampling from complex, unnormalized distributions. In this work, we extend Maximum Entropy Reinforcement Learning (ME-RL) to diffusion processes, enabling sampling from the optimal policy trajectory distribution. By minimizing a tractable upper bound on the reverse KL divergence between the diffusion policy and the optimal policy trajectory distributions, we derive a modified surrogate objective and introduce Diffusion-Augmented Markov Decision Processes (DA-MDPs). DA-MDPs allow for seamless integration of diffusion policies into any ME-RL method with minimal modifications. We demonstrate its effectiveness by adapting Proximal Policy Optimization (PPO), Wasserstein Policy Optimization (WPO), and Relative Entropy Pathwise Policy Optimization (REPPO) into their diffusion-based variants: DA-MDP: PPO, DA-MDP: WPO, and DA-MDP: REPPO. Empirical results on standard continuous-control benchmarks show that our approach matches or outperforms baseline methods, while experiments on multimodal benchmarks confirm its ability to model multimodal action distributions.","short_abstract":"Diffusion models excel at sampling from complex, unnormalized distributions. In this work, we extend Maximum Entropy Reinforcement Learning (ME-RL) to diffusion processes, enabling sampling from the optimal policy trajectory distribution. By minimizing a tractable upper bound on the reverse KL divergence between the di...","url_abs":"https://arxiv.org/abs/2512.02019","url_pdf":"https://arxiv.org/pdf/2512.02019v3","authors":"[\"Sebastian Sanokowski\",\"Kaustubh Patil\"]","published":"2025-12-01T18:59:58Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"stat.ML\"]","methods":"[\"Reinforcement Learning\",\"Diffusion Model\"]","has_code":false}
