{"ID":2847865,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.26122","arxiv_id":"2510.26122","title":"Reasoning Path Divergence: A New Metric and Curation Strategy to Unlock LLM Diverse Thinking","abstract":"While Test-Time Scaling (TTS) has proven effective in improving the reasoning ability of large language models (LLMs), low diversity in model outputs often becomes a bottleneck; this is partly caused by the common \"one problem, one solution\" (1P1S) training practice, which provides a single canonical answer and can push models toward a narrow set of reasoning paths. This homogenization not only limits sampling effectiveness but also restricts the exploration space for subsequent Reinforcement Learning (RL) stages. To address this, we propose a \"one problem, multiple solutions\" (1PNS) training paradigm that exposes the model to a variety of valid reasoning trajectories and thus increases inference diversity. A core challenge for 1PNS is reliably measuring semantic differences between multi-step chains of thought, so we introduce Reasoning Path Divergence (RPD), a step-level metric that aligns and scores Long Chain-of-Thought solutions to capture differences in intermediate reasoning. Using RPD, we curate maximally diverse solution sets per problem and fine-tune Qwen3-4B-Base. Experiments show that RPD-selected training yields more varied outputs and higher pass@k, with an average +2.80% gain in pass@16 over a strong 1P1S baseline and a +4.99% gain on AIME24, demonstrating that 1PNS further amplifies the effectiveness of TTS. Our code is available at https://github.com/fengjujf/Reasoning-Path-Divergence .","short_abstract":"While Test-Time Scaling (TTS) has proven effective in improving the reasoning ability of large language models (LLMs), low diversity in model outputs often becomes a bottleneck; this is partly caused by the common \"one problem, one solution\" (1P1S) training practice, which provides a single canonical answer and can pus...","url_abs":"https://arxiv.org/abs/2510.26122","url_pdf":"https://arxiv.org/pdf/2510.26122v2","authors":"[\"Feng Ju\",\"Zeyu Qin\",\"Rui Min\",\"Zhitao He\",\"Lingpeng Kong\",\"Yi R. Fung\"]","published":"2025-10-30T04:08:53Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\",\"LoRA\"]","has_code":false,"code_links":[{"ID":607568,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2847865,"paper_url":"https://arxiv.org/abs/2510.26122","paper_title":"Reasoning Path Divergence: A New Metric and Curation Strategy to Unlock LLM Diverse Thinking","repo_url":"https://github.com/fengjujf/Reasoning-Path-Divergence","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
