{"ID":2823656,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.24693","arxiv_id":"2512.24693","title":"MUSIC: MUlti-Step Instruction Contrast for Multi-Turn Reward Models","abstract":"Evaluating the quality of multi-turn conversations is crucial for developing capable Large Language Models (LLMs), yet remains a significant challenge, often requiring costly human evaluation. Multi-turn reward models (RMs) offer a scalable alternative and can provide valuable signals for guiding LLM training. While recent work has advanced multi-turn \\textit{training} techniques, effective automated \\textit{evaluation} specifically for multi-turn interactions lags behind. We observe that standard preference datasets, typically contrasting responses based only on the final conversational turn, provide insufficient signal to capture the nuances of multi-turn interactions. Instead, we find that incorporating contrasts spanning \\textit{multiple} turns is critical for building robust multi-turn RMs. Motivated by this finding, we propose \\textbf{MU}lti-\\textbf{S}tep \\textbf{I}nstruction \\textbf{C}ontrast (MUSIC), an unsupervised data augmentation strategy that synthesizes contrastive conversation pairs exhibiting differences across multiple turns. Leveraging MUSIC on the Skywork preference dataset, we train a multi-turn RM based on the Gemma-2-9B-Instruct model. Empirical results demonstrate that our MUSIC-augmented RM outperforms baseline methods, achieving higher alignment with judgments from advanced proprietary LLM judges on multi-turn conversations, crucially, without compromising performance on standard single-turn RM benchmarks.","short_abstract":"Evaluating the quality of multi-turn conversations is crucial for developing capable Large Language Models (LLMs), yet remains a significant challenge, often requiring costly human evaluation. Multi-turn reward models (RMs) offer a scalable alternative and can provide valuable signals for guiding LLM training. While re...","url_abs":"https://arxiv.org/abs/2512.24693","url_pdf":"https://arxiv.org/pdf/2512.24693v1","authors":"[\"Wenzhe Li\",\"Shujian Zhang\",\"Wenxuan Zhou\",\"John Lambert\",\"Chi Jin\",\"Andrew Hard\",\"Rajiv Mathews\",\"Lun Wang\"]","published":"2025-12-31T07:54:30Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
