{"ID":2853860,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.15262","arxiv_id":"2510.15262","title":"Robust Layerwise Scaling Rules by Proper Weight Decay Tuning","abstract":"Empirical scaling laws prescribe how to allocate parameters, data, and compute, while maximal-update parameterization ($μ$P) enables learning-rate transfer across widths by equalizing early-time update magnitudes. However, in modern scale-invariant architectures, training quickly enters an optimizer-governed steady state where normalization layers create backward scale sensitivity and the effective learning rate becomes width dependent, degrading $μ$P transfer. We address this by introducing a weight-decay scaling rule for AdamW that preserves sublayer gain across widths. Empirically, the singular-value spectrum of each matrix parameter scales in norm as $\\sqrt{η/λ}$ with an approximately invariant shape; under width scaling $d$, we observe that the top singular value scales approximately as $\\sqrt{η/λ}\\cdot d^{0.75}$. Combining this observation with the $μ$P learning-rate rule $η_2\\propto d^{-1}$ for matrix-like parameters implies an empirical weight-decay scaling rule $λ_2\\propto \\sqrt{d}$ that approximately keeps sublayer gains width invariant. Together with vector-like parameters trained at $η_1=Θ_d(1)$ and $λ_1=0$, this yields \\emph{zero-shot} transfer of both learning rate and weight decay from proxy to target widths, removing per-width sweeps. We validate the rule on LLaMA-style Transformers and in a minimal synthetic setting, and we provide a simple diagnostic, matching top singular values, to check sublayer-gain invariance. Our results extend $μ$P beyond the near-init regime by explicitly controlling steady-state scales set by the optimizer, offering a practical recipe for width-robust hyperparameter transfer under AdamW.","short_abstract":"Empirical scaling laws prescribe how to allocate parameters, data, and compute, while maximal-update parameterization ($μ$P) enables learning-rate transfer across widths by equalizing early-time update magnitudes. However, in modern scale-invariant architectures, training quickly enters an optimizer-governed steady sta...","url_abs":"https://arxiv.org/abs/2510.15262","url_pdf":"https://arxiv.org/pdf/2510.15262v1","authors":"[\"Zhiyuan Fan\",\"Yifeng Liu\",\"Qingyue Zhao\",\"Angela Yuan\",\"Quanquan Gu\"]","published":"2025-10-17T02:58:35Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"stat.ML\"]","methods":"[\"Transformer\"]","has_code":false}
