{"ID":2849785,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.23596","arxiv_id":"2510.23596","title":"Think Twice: Branch-and-Rethink Reasoning Reward Model","abstract":"Large language models (LLMs) increasingly rely on thinking models that externalize intermediate steps and allocate extra test-time compute, with think-twice strategies showing that a deliberate second pass can elicit stronger reasoning. In contrast, most reward models (RMs) still compress many quality dimensions into a single scalar in one shot, a design that induces judgment diffusion: attention spreads across evaluation criteria, yielding diluted focus and shallow analysis. We introduce branch-and-rethink (BR-RM), a two-turn RM that transfers the think-twice principle to reward modeling. Turn 1 performs adaptive branching, selecting a small set of instance-critical dimensions (such as factuality and safety) and sketching concise, evidence-seeking hypotheses. Turn 2 executes branch-conditioned rethinking, a targeted reread that tests those hypotheses and scrutinizes only what matters most. We train with GRPO-style reinforcement learning over structured two-turn traces using a simple binary outcome reward with strict format checks, making the approach compatible with standard RLHF pipelines. By converting all-at-once scoring into focused, second-look reasoning, BR-RM reduces judgment diffusion and improves sensitivity to subtle yet consequential errors while remaining practical and scalable. Experimental results demonstrate that our model achieves state-of-the-art performance on three challenging reward modeling benchmarks across diverse domains.","short_abstract":"Large language models (LLMs) increasingly rely on thinking models that externalize intermediate steps and allocate extra test-time compute, with think-twice strategies showing that a deliberate second pass can elicit stronger reasoning. In contrast, most reward models (RMs) still compress many quality dimensions into a...","url_abs":"https://arxiv.org/abs/2510.23596","url_pdf":"https://arxiv.org/pdf/2510.23596v3","authors":"[\"Yizhu Jiao\",\"Jiaqi Zeng\",\"Julien Veron Vialard\",\"Oleksii Kuchaiev\",\"Jiawei Han\",\"Olivier Delalleau\"]","published":"2025-10-27T17:58:07Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Reinforcement Learning\",\"Diffusion Model\",\"Large Language Model\",\"Language Model\",\"RLHF\"]","has_code":false}
