{"ID":2864571,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.23106","arxiv_id":"2509.23106","title":"Effective Quantization of Muon Optimizer States","abstract":"The Muon optimizer, based on matrix orthogonalization, has recently shown faster convergence and better computational efficiency over AdamW in LLM pre-training. However, the memory overhead of maintaining high-precision optimizer states remains a challenge for large-scale deployment. In this paper, we introduce the 8-bit Muon optimizer using blockwise quantization. In extensive Chinchilla-optimal experiments on pre-training models of up to 2.7B in size and fine-tuning them for instruction following, we demonstrate that 8-bit Muon achieves parity with Muon in terms of validation loss and downstream benchmarks, while achieving up to a 62\\% reduction in optimizer state footprint. Crucially, we show that Muon's update mechanism is uniquely compatible with a simple linear quantization scheme, bypassing the complex dynamic scaling required for quantized AdamW. We supplement our empirical findings with a theoretical analysis of Muon's robustness to quantization noise.","short_abstract":"The Muon optimizer, based on matrix orthogonalization, has recently shown faster convergence and better computational efficiency over AdamW in LLM pre-training. However, the memory overhead of maintaining high-precision optimizer states remains a challenge for large-scale deployment. In this paper, we introduce the 8-b...","url_abs":"https://arxiv.org/abs/2509.23106","url_pdf":"https://arxiv.org/pdf/2509.23106v3","authors":"[\"Aman Gupta\",\"Rafael Celente\",\"Abhishek Shivanna\",\"D. T. Braithwaite\",\"Gregory Dexter\",\"Shao Tang\",\"Hiroto Udagawa\",\"Daniel Silva\",\"Rohan Ramanath\",\"S. Sathiya Keerthi\"]","published":"2025-09-27T04:31:11Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Large Language Model\"]","has_code":false}
