{"ID":2865972,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.21070","arxiv_id":"2509.21070","title":"ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning","abstract":"Large Reasoning Models (LRMs) have shown impressive capabilities in complex problem-solving, often benefiting from training on difficult mathematical problems that stimulate intricate reasoning. Recent efforts have explored automated synthesis of mathematical problems by prompting proprietary models or large-scale open-source models from seed data or inherent mathematical concepts. However, scaling up these methods remains challenging due to their high computational/API cost, complexity of prompting, and limited difficulty level of the generated problems. To overcome these limitations, we propose ScaleDiff, a simple yet effective pipeline designed to scale the creation of difficult problems. We efficiently identify difficult problems from existing datasets with only a single forward pass using an adaptive thinking model, which can perceive problem difficulty and automatically switch between \"Thinking\" and \"NoThinking\" modes. We then train a specialized difficult problem generator (DiffGen-8B) on this filtered difficult data, which can produce new difficult problems in large scale, eliminating the need for complex, per-instance prompting and its associated high API costs. Fine-tuning Qwen2.5-Math-7B-Instruct on the ScaleDiff-Math dataset yields a substantial performance increase of 11.3% compared to the original dataset and achieves a 65.9% average accuracy on AIME'24, AIME'25, HMMT-Feb'25, BRUMO'25, and MATH500, outperforming recent strong LRMs like OpenThinker3. Notably, this performance is achieved using the cost-efficient Qwen3-8B model as a teacher, demonstrating that our pipeline can effectively transfer advanced reasoning capabilities without relying on larger, more expensive teacher models. Furthermore, we observe a clear scaling phenomenon in model performance on difficult benchmarks as the quantity of difficult problems increases. Code: https://github.com/QizhiPei/ScaleDiff.","short_abstract":"Large Reasoning Models (LRMs) have shown impressive capabilities in complex problem-solving, often benefiting from training on difficult mathematical problems that stimulate intricate reasoning. Recent efforts have explored automated synthesis of mathematical problems by prompting proprietary models or large-scale open...","url_abs":"https://arxiv.org/abs/2509.21070","url_pdf":"https://arxiv.org/pdf/2509.21070v1","authors":"[\"Qizhi Pei\",\"Zhuoshi Pan\",\"Honglin Lin\",\"Xin Gao\",\"Yu Li\",\"Zinan Tang\",\"Conghui He\",\"Rui Yan\",\"Lijun Wu\"]","published":"2025-09-25T12:22:44Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.CL\"]","methods":"[]","has_code":false,"code_links":[{"ID":609325,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2865972,"paper_url":"https://arxiv.org/abs/2509.21070","paper_title":"ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning","repo_url":"https://github.com/QizhiPei/ScaleDiff","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
