{"ID":2868749,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.15816","arxiv_id":"2509.15816","title":"On the Convergence of Muon and Beyond","abstract":"The Muon optimizer has demonstrated remarkable empirical success in handling matrix-structured parameters for training neural networks. However, a significant gap remains between its practical performance and theoretical understanding. Existing analyses show that the Muon variants achieve only a suboptimal ergodic convergence rate of $\\mathcal{O}(T^{-1/4})$ in stochastic non-convex settings, where $T$ denotes the number of iterations. To study the theoretical limits of Muon, we analyze two momentum-based variance-reduced variants: the one-batch Muon-MVR1 and the two-batch Muon-MVR2. We provide the first rigorous proof that, under \\textbf{horizon-free} learning-rate schedules, variance reduction enables Muon-MVR2 to attain the optimal anytime convergence rate $\\widetilde{\\mathcal{O}}(T^{-1/3})$, matching the lower bound for this problem class. Under the Polyak--Łojasiewicz (PL) condition, we establish anytime guarantees for Muon-MVR1 and Muon-MVR2: they attain best-iterate rates of $\\widetilde{\\mathcal{O}}(T^{-1/4})$ and $\\widetilde{\\mathcal{O}}(T^{-1/3})$ for the expected square-root suboptimality, and, given an additional uniform gradient bound along the iterates, achieve last-iterate rates of $\\mathcal{O}(T^{-1/4})$ and $\\mathcal{O}(T^{-1/3})$ for the objective gap, respectively. Experiments on CIFAR-10 and C4 support the practical effectiveness of the proposed variance-reduced Muon variants. Code is available at \\href{https://github.com/MaeChd/MUON-MVR}{Muon-MVR} Codebase.","short_abstract":"The Muon optimizer has demonstrated remarkable empirical success in handling matrix-structured parameters for training neural networks. However, a significant gap remains between its practical performance and theoretical understanding. Existing analyses show that the Muon variants achieve only a suboptimal ergodic conv...","url_abs":"https://arxiv.org/abs/2509.15816","url_pdf":"https://arxiv.org/pdf/2509.15816v5","authors":"[\"Da Chang\",\"Yongxiang Liu\",\"Ganzhao Yuan\"]","published":"2025-09-19T09:43:37Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[]","has_code":false,"code_links":[{"ID":609616,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2868749,"paper_url":"https://arxiv.org/abs/2509.15816","paper_title":"On the Convergence of Muon and Beyond","repo_url":"https://github.com/MaeChd/MUON-MVR","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
