{"ID":2855710,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.12402","arxiv_id":"2510.12402","title":"Cautious Weight Decay","abstract":"We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.","short_abstract":"We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss a...","url_abs":"https://arxiv.org/abs/2510.12402","url_pdf":"https://arxiv.org/pdf/2510.12402v2","authors":"[\"Lizhang Chen\",\"Jonathan Li\",\"Kaizhao Liang\",\"Baiyu Su\",\"Cong Xie\",\"Nuo Wang Pierse\",\"Chen Liang\",\"Ni Lao\",\"Qiang Liu\"]","published":"2025-10-14T11:32:55Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"math.OC\",\"stat.ML\"]","methods":"[\"Language Model\"]","has_code":false}