{"ID":2891420,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.17634","arxiv_id":"2507.17634","title":"WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training","abstract":"Recent advances in learning rate (LR) scheduling have demonstrated the effectiveness of decay-free approaches that eliminate the traditional decay phase while maintaining competitive performance. Model merging techniques have emerged as particularly promising solutions in this domain. We present Warmup-Stable and Merge (WSM), a general framework that establishes a formal connection between learning rate decay and model merging. WSM provides a unified theoretical foundation for emulating various decay strategies-including cosine decay, linear decay and inverse square root decay-as principled model averaging schemes, while remaining fully compatible with diverse optimization methods. Through extensive experiments, we identify merge duration-the training window for checkpoint aggregation-as the most critical factor influencing model performance, surpassing the importance of both checkpoint interval and merge quantity. Our framework consistently outperforms the widely-adopted Warmup-Stable-Decay (WSD) approach across multiple benchmarks, achieving significant improvements of +3.5% on MATH, +2.9% on HumanEval, and +5.5% on MMLU-Pro. The performance advantages extend to supervised fine-tuning scenarios, highlighting WSM's potential for long-term model refinement.","short_abstract":"Recent advances in learning rate (LR) scheduling have demonstrated the effectiveness of decay-free approaches that eliminate the traditional decay phase while maintaining competitive performance. Model merging techniques have emerged as particularly promising solutions in this domain. We present Warmup-Stable and Merge...","url_abs":"https://arxiv.org/abs/2507.17634","url_pdf":"https://arxiv.org/pdf/2507.17634v2","authors":"[\"Changxin Tian\",\"Jiapeng Wang\",\"Qian Zhao\",\"Kunlong Chen\",\"Jia Liu\",\"Ziqi Liu\",\"Jiaxin Mao\",\"Wayne Xin Zhao\",\"Zhiqiang Zhang\",\"Jun Zhou\"]","published":"2025-07-23T16:02:06Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.LG\"]","methods":"[\"Large Language Model\"]","has_code":false}
