{"ID":2870957,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.11983","arxiv_id":"2509.11983","title":"Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training","abstract":"Neural network (NN) training is inherently a large-scale matrix optimization problem, yet the matrix structure of NN parameters has long been overlooked. Recently, the optimizer Muon \\citep{jordanmuon}, which explicitly exploits this structure, has gained significant attention for its strong performance in foundation model training. A key component contributing to Muon's success is matrix orthogonalization. In this paper, we propose \\textit{low-rank orthogonalization}, which performs orthogonalization by leveraging the low-rank nature of gradients during NN training. Building on this, we introduce low-rank matrix-signed gradient descent (MSGD) and a low-rank variant of Muon. Numerical experiments demonstrate the superior performance of low-rank orthogonalization, with low-rank Muon achieving promising results in GPT-2 and LLaMA pretraining -- surpassing the carefully tuned vanilla Muon on tasks with large model sizes. Theoretically, we establish the iteration complexity of low-rank MSGD for finding an approximate stationary solution, and the iteration complexity of low-rank Muon for finding an approximate stochastic stationary solution under heavy-tailed noise. The code to reproduce our numerical experiments is available at https://github.com/dengzhanwang/Low-rank-Muon.","short_abstract":"Neural network (NN) training is inherently a large-scale matrix optimization problem, yet the matrix structure of NN parameters has long been overlooked. Recently, the optimizer Muon \\citep{jordanmuon}, which explicitly exploits this structure, has gained significant attention for its strong performance in foundation m...","url_abs":"https://arxiv.org/abs/2509.11983","url_pdf":"https://arxiv.org/pdf/2509.11983v2","authors":"[\"Chuan He\",\"Zhanwang Deng\",\"Zhaosong Lu\"]","published":"2025-09-15T14:28:53Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"math.OC\"]","methods":"[]","has_code":false,"code_links":[{"ID":609814,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2870957,"paper_url":"https://arxiv.org/abs/2509.11983","paper_title":"Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training","repo_url":"https://github.com/dengzhanwang/Low-rank-Muon","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}