{"ID":2875536,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.02492","arxiv_id":"2509.02492","title":"GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning","abstract":"Significant progress in reward modeling over recent years has been driven by a paradigm shift from task-specific designs towards generalist reward models. Despite this trend, developing effective reward models remains a fundamental challenge: the heavy reliance on large-scale labeled preference data. Pre-training on abundant unlabeled data offers a promising direction, but existing approaches fall short of instilling explicit reasoning into reward models. To bridge this gap, we propose a self-training approach that leverages unlabeled data to elicit reward reasoning in reward models. Based on this approach, we develop GRAM-R$^2$, a generative reward model trained to produce not only preference labels but also accompanying reward rationales. GRAM-R$^2$ can serve as a foundation model for reward reasoning and can be applied to a wide range of tasks with minimal or no additional fine-tuning. It can support downstream applications such as response ranking and task-specific reward tuning. Experiments on response ranking, task adaptation, and reinforcement learning from human feedback demonstrate that GRAM-R$^2$ consistently delivers strong performance, outperforming several strong discriminative and generative baselines.","short_abstract":"Significant progress in reward modeling over recent years has been driven by a paradigm shift from task-specific designs towards generalist reward models. Despite this trend, developing effective reward models remains a fundamental challenge: the heavy reliance on large-scale labeled preference data. Pre-training on ab...","url_abs":"https://arxiv.org/abs/2509.02492","url_pdf":"https://arxiv.org/pdf/2509.02492v3","authors":"[\"Chenglong Wang\",\"Yongyu Mu\",\"Hang Zhou\",\"Yifu Huo\",\"Ziming Zhu\",\"Jiali Zeng\",\"Murun Yang\",\"Bei Li\",\"Xiaoyang Hao\",\"Chunliang Zhang\",\"Fandong Meng\",\"Jingbo Zhu\",\"Tong Xiao\"]","published":"2025-09-02T16:41:07Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.LG\"]","methods":"[\"Reinforcement Learning\"]","has_code":false}
