{"ID":2869501,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.15207","arxiv_id":"2509.15207","title":"FlowRL: Matching Reward Distributions for LLM Reasoning","abstract":"We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of $10.0\\%$ over GRPO and $5.1\\%$ over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.","short_abstract":"We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting...","url_abs":"https://arxiv.org/abs/2509.15207","url_pdf":"https://arxiv.org/pdf/2509.15207v3","authors":"[\"Xuekai Zhu\",\"Daixuan Cheng\",\"Dinghuai Zhang\",\"Hengli Li\",\"Kaiyan Zhang\",\"Che Jiang\",\"Youbang Sun\",\"Ermo Hua\",\"Yuxin Zuo\",\"Xingtai Lv\",\"Qizheng Zhang\",\"Lin Chen\",\"Fanghao Shao\",\"Bo Xue\",\"Yunchong Song\",\"Zhenjie Yang\",\"Ganqu Cui\",\"Ning Ding\",\"Jianfeng Gao\",\"Xiaodong Liu\",\"Bowen Zhou\",\"Hongyuan Mei\",\"Zhouhan Lin\"]","published":"2025-09-18T17:56:36Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.CL\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\",\"LoRA\"]","has_code":false}
