{"ID":2868610,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.25534","arxiv_id":"2509.25534","title":"Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning","abstract":"Open-ended evaluation is essential for deploying large language models in real-world settings. In studying HealthBench, we observe that using the model itself as a grader and generating rubric-based reward signals substantially improves reasoning performance. Remarkably, the trained model also becomes a stronger grader. Motivated by this, we introduce Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning, a lightweight framework that enables faster and more resource-efficient training while surpassing baselines. Remarkably, on Qwen3-32B, training with just the 4000-sample HealthBench Easy subset is sufficient to obtain a model that exceeds GPT-5 on HealthBench Hard. Incorporating a small amount of teacher-graded data further enhances performance for less capable models.","short_abstract":"Open-ended evaluation is essential for deploying large language models in real-world settings. In studying HealthBench, we observe that using the model itself as a grader and generating rubric-based reward signals substantially improves reasoning performance. Remarkably, the trained model also becomes a stronger grader...","url_abs":"https://arxiv.org/abs/2509.25534","url_pdf":"https://arxiv.org/pdf/2509.25534v1","authors":"[\"Zhiling Ye\",\"Yun Yue\",\"Haowen Wang\",\"Xudong Han\",\"Jiadi Jiang\",\"Cheng Wei\",\"Lei Fan\",\"Jiaxin Liang\",\"Shuowen Zhang\",\"Ji Li\",\"Chunxiao Guo\",\"Jian Wang\",\"Peng Wei\",\"Jinjie Gu\"]","published":"2025-09-19T05:08:55Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Reinforcement Learning\",\"Language Model\"]","has_code":false}
