{"ID":2834815,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.00743","arxiv_id":"2512.00743","title":"Multi-GRPO: Multi-Group Advantage Estimation for Text-to-Image Generation with Tree-Based Trajectories and Multiple Rewards","abstract":"Recently, Group Relative Policy Optimization (GRPO) has shown promising potential for aligning text-to-image (T2I) models, yet existing GRPO-based methods suffer from two critical limitations. (1) \\textit{Shared credit assignment}: trajectory-level advantages derived from group-normalized sparse terminal rewards are uniformly applied across timesteps, failing to accurately estimate the potential of early denoising steps with vast exploration spaces. (2) \\textit{Reward-mixing}: predefined weights for combining multi-objective rewards (e.g., text accuracy, visual quality, text color)--which have mismatched scales and variances--lead to unstable gradients and conflicting updates. To address these issues, we propose \\textbf{Multi-GRPO}, a multi-group advantage estimation framework with two orthogonal grouping mechanisms. For better credit assignment, we introduce tree-based trajectories inspired by Monte Carlo Tree Search: branching trajectories at selected early denoising steps naturally forms \\emph{temporal groups}, enabling accurate advantage estimation for early steps via descendant leaves while amortizing computation through shared prefixes. For multi-objective optimization, we introduce \\emph{reward-based grouping} to compute advantages for each reward function \\textit{independently} before aggregation, disentangling conflicting signals. To facilitate evaluation of multiple objective alignment, we curate \\textit{OCR-Color-10}, a visual text rendering dataset with explicit color constraints. Across the single-reward \\textit{PickScore-25k} and multi-objective \\textit{OCR-Color-10} benchmarks, Multi-GRPO achieves superior stability and alignment performance, effectively balancing conflicting objectives. Code will be publicly available at \\href{https://github.com/fikry102/Multi-GRPO}{https://github.com/fikry102/Multi-GRPO}.","short_abstract":"Recently, Group Relative Policy Optimization (GRPO) has shown promising potential for aligning text-to-image (T2I) models, yet existing GRPO-based methods suffer from two critical limitations. (1) \\textit{Shared credit assignment}: trajectory-level advantages derived from group-normalized sparse terminal rewards are un...","url_abs":"https://arxiv.org/abs/2512.00743","url_pdf":"https://arxiv.org/pdf/2512.00743v1","authors":"[\"Qiang Lyu\",\"Zicong Chen\",\"Chongxiao Wang\",\"Haolin Shi\",\"Shibo Gao\",\"Ran Piao\",\"Youwei Zeng\",\"Jianlou Si\",\"Fei Ding\",\"Jing Li\",\"Chun Pong Lau\",\"Weiqiang Wang\"]","published":"2025-11-30T05:44:35Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"LoRA\"]","has_code":false,"code_links":[{"ID":606448,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2834815,"paper_url":"https://arxiv.org/abs/2512.00743","paper_title":"Multi-GRPO: Multi-Group Advantage Estimation for Text-to-Image Generation with Tree-Based Trajectories and Multiple Rewards","repo_url":"https://github.com/fikry102/Multi-GRPO","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
