{"ID":2852894,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.21800","arxiv_id":"2510.21800","title":"MARS-M: When Variance Reduction Meets Matrices","abstract":"Matrix-based preconditioned optimizers, such as Muon, have recently been shown to be more efficient than scalar-based optimizers for training large-scale neural networks, including large language models (LLMs). Recent benchmark studies of LLM pretraining optimizers have demonstrated that variance-reduction techniques such as MARS can substantially speed up training compared with standard optimizers that do not employ variance reduction. In this paper, we introduce MARS-M, a new optimizer that integrates MARS-style variance reduction with Muon. Under standard regularity conditions, we prove that MARS-M converges to a first-order stationary point at a rate of $\\tilde{\\mathcal{O}}(T^{-1/3})$, improving upon the $\\tilde{\\mathcal{O}}(T^{-1/4})$ rate attained by Muon. Empirical results on language modeling and computer vision tasks demonstrate that MARS-M consistently yields lower losses and improved performance across various downstream benchmarks. The implementation of MARS-M is available at https://github.com/AGI-Arena/MARS/tree/main/MARS_M.","short_abstract":"Matrix-based preconditioned optimizers, such as Muon, have recently been shown to be more efficient than scalar-based optimizers for training large-scale neural networks, including large language models (LLMs). Recent benchmark studies of LLM pretraining optimizers have demonstrated that variance-reduction techniques s...","url_abs":"https://arxiv.org/abs/2510.21800","url_pdf":"https://arxiv.org/pdf/2510.21800v3","authors":"[\"Yifeng Liu\",\"Angela Yuan\",\"Quanquan Gu\"]","published":"2025-10-20T16:49:22Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"math.OC\",\"stat.ML\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":608040,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2852894,"paper_url":"https://arxiv.org/abs/2510.21800","paper_title":"MARS-M: When Variance Reduction Meets Matrices","repo_url":"https://github.com/AGI-Arena/MARS","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}