{"ID":2861464,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.01982","arxiv_id":"2510.01982","title":"Fine-Grained GRPO for Precise Preference Alignment in Flow Models","abstract":"The incorporation of online reinforcement learning (RL) into diffusion and flow-based generative models has recently gained attention as a powerful paradigm for aligning model behavior with human preferences. By leveraging stochastic sampling via Stochastic Differential Equations (SDEs) during the denoising phase, these models can explore a variety of denoising trajectories, enhancing the exploratory capacity of RL. However, despite their ability to discover potentially high-reward samples, current approaches often struggle to effectively align with preferences due to the sparsity and narrowness of reward feedback. To overcome this limitation, we introduce a novel framework called Granular-GRPO (G$^2$RPO), which enables fine-grained and comprehensive evaluation of sampling directions in the RL training of flow models. Specifically, we propose a Singular Stochastic Sampling mechanism that supports step-wise stochastic exploration while ensuring strong correlation between injected noise and reward signals, enabling more accurate credit assignment to each SDE perturbation. Additionally, to mitigate the bias introduced by fixed-granularity denoising, we design a Multi-Granularity Advantage Integration module that aggregates advantages computed across multiple diffusion scales, resulting in a more robust and holistic assessment of sampling trajectories. Extensive experiments on various reward models, including both in-domain and out-of-domain settings, demonstrate that our G$^2$RPO outperforms existing flow-based GRPO baselines, highlighting its effectiveness and generalization capability.","short_abstract":"The incorporation of online reinforcement learning (RL) into diffusion and flow-based generative models has recently gained attention as a powerful paradigm for aligning model behavior with human preferences. By leveraging stochastic sampling via Stochastic Differential Equations (SDEs) during the denoising phase, thes...","url_abs":"https://arxiv.org/abs/2510.01982","url_pdf":"https://arxiv.org/pdf/2510.01982v3","authors":"[\"Yujie Zhou\",\"Pengyang Ling\",\"Jiazi Bu\",\"Yibin Wang\",\"Yuhang Zang\",\"Jiaqi Wang\",\"Li Niu\",\"Guangtao Zhai\"]","published":"2025-10-02T12:57:12Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.CV\"]","methods":"[\"Reinforcement Learning\",\"Diffusion Model\",\"LoRA\"]","has_code":false}
