{"ID":2830963,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.08153","arxiv_id":"2512.08153","title":"TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models","abstract":"Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption. We introduce \\textbf{TreeGRPO}, a novel RL framework that dramatically improves training efficiency by recasting the denoising process as a search tree. From shared initial noise samples, TreeGRPO strategically branches to generate multiple candidate trajectories while efficiently reusing their common prefixes. This tree-structured approach delivers three key advantages: (1) \\emph{High sample efficiency}, achieving better performance under same training samples (2) \\emph{Fine-grained credit assignment} via reward backpropagation that computes step-specific advantages, overcoming the uniform credit assignment limitation of trajectory-based methods, and (3) \\emph{Amortized computation} where multi-child branching enables multiple policy updates per forward pass. Extensive experiments on both diffusion and flow-based models demonstrate that TreeGRPO achieves \\textbf{2.4$\\times$ faster training} while establishing a superior Pareto frontier in the efficiency-reward trade-off space. Our method consistently outperforms GRPO baselines across multiple benchmarks and reward models, providing a scalable and effective pathway for RL-based visual generative model alignment. The project website is available at treegrpo.github.io.","short_abstract":"Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption. We introduce \\textbf{TreeGRPO}, a novel RL framework that dramatically improves training efficiency by recasting the denoisin...","url_abs":"https://arxiv.org/abs/2512.08153","url_pdf":"https://arxiv.org/pdf/2512.08153v1","authors":"[\"Zheng Ding\",\"Weirui Ye\"]","published":"2025-12-09T01:17:34Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.CV\"]","methods":"[\"Reinforcement Learning\",\"Diffusion Model\"]","has_code":false}
