{"ID":3045290,"CreatedAt":"2026-06-04T00:20:49.549693789Z","UpdatedAt":"2026-06-06T23:53:19.85481169Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2605.29156","arxiv_id":"2605.29156","title":"RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains","abstract":"Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rubric-based methods address this by decomposing evaluation into explicit criteria, but existing approaches typically depend on frontier LLMs and suffer from ties caused by hard Boolean aggregation. We present RUBRIC-ARROW, an alternating framework that jointly trains a rubric generator and a rubric-conditioned judge, with its RL stage using only pairwise preference data. Our method couples a probability-based scoring rule that reduces ties with phase-specific preference-based rewards and an alternating GRPO scheme that together train the pointwise evaluator. Extensive experiments show that RUBRIC-ARROW achieves competitive reward-modeling accuracy and yields consistent gains for downstream policy post-training.","short_abstract":"Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rubric-based methods address this by decomposing evaluation into explicit criteria, but existing approaches typically depend on frontier LLMs and suffer from ties caused b...","url_abs":"https://arxiv.org/abs/2605.29156","url_pdf":"https://arxiv.org/pdf/2605.29156v1","authors":"[\"Haoxiang Jiang\",\"Zihan Dong\",\"Tianci Liu\",\"Wanying Wang\",\"Ran Xu\",\"Tony Yu\",\"Linjun Zhang\",\"Haoyu Wang\"]","published":"2026-05-27T22:46:25Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.CL\"]","methods":"[\"Large Language Model\"]","has_code":false}
