{"ID":2882775,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.09752","arxiv_id":"2508.09752","title":"$μ$-Parametrization for Mixture of Experts","abstract":"Recent years have seen a growing interest and adoption of LLMs, with Mixture-of-Experts (MoE) emerging as a leading architecture in extremely large models. Currently, the largest open-source models reach over $1$T parameters. At such scales, hyperparameter tuning becomes prohibitively expensive. Precisely for this reason, the $μ$Transfer is becoming a key technique. It allows for seamless transfer of optimal hyperparameters across model scales, resulting in a huge reduction in tuning costs. However, existing work has primarily focused on dense LLMs, leaving MoE architectures unexplored. In this work, we derive a $μ$-Parameterization for MoE, providing theoretical guarantees for feature learning across model widths. Our experiments demonstrate that the optimal learning rate reliably transfers across model sizes, establishing a foundation for efficient hyperparameter tuning in large-scale MoE models.","short_abstract":"Recent years have seen a growing interest and adoption of LLMs, with Mixture-of-Experts (MoE) emerging as a leading architecture in extremely large models. Currently, the largest open-source models reach over $1$T parameters. At such scales, hyperparameter tuning becomes prohibitively expensive. Precisely for this reas...","url_abs":"https://arxiv.org/abs/2508.09752","url_pdf":"https://arxiv.org/pdf/2508.09752v2","authors":"[\"Jan Małaśnicki\",\"Kamil Ciebiera\",\"Mateusz Boruń\",\"Maciej Pióro\",\"Jan Ludziejewski\",\"Maciej Stefaniak\",\"Michał Krutul\",\"Sebastian Jaszczur\",\"Marek Cygan\",\"Kamil Adamczewski\",\"Jakub Krajewski\"]","published":"2025-08-13T12:31:27Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Mixture of Experts\",\"Large Language Model\"]","has_code":false}