{"ID":2898384,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.03526","arxiv_id":"2507.03526","title":"Decoupled Relative Learning Rate Schedules","abstract":"In this work, we introduce a novel approach for optimizing LLM training by adjusting learning rates across weights of different components in Transformer models. Traditional methods often apply a uniform learning rate across all network layers, potentially overlooking the unique dynamics of each part. Remarkably, our introduced relative learning rates, RLRS, method accelerates the training process by up to $23\\%$, particularly in complex models such as Mixture of Experts (MoE). Hyperparameters of RLRS can be efficiently tuned on smaller models and then effectively reused on models up to $27\\times$ larger. This simple and effective method results in a substantial reduction in training time and computational resources, offering a practical and scalable solution for optimizing large-scale neural networks.","short_abstract":"In this work, we introduce a novel approach for optimizing LLM training by adjusting learning rates across weights of different components in Transformer models. Traditional methods often apply a uniform learning rate across all network layers, potentially overlooking the unique dynamics of each part. Remarkably, our i...","url_abs":"https://arxiv.org/abs/2507.03526","url_pdf":"https://arxiv.org/pdf/2507.03526v1","authors":"[\"Jan Ludziejewski\",\"Jan Małaśnicki\",\"Maciej Pióro\",\"Michał Krutul\",\"Kamil Ciebiera\",\"Maciej Stefaniak\",\"Jakub Krajewski\",\"Piotr Sankowski\",\"Marek Cygan\",\"Kamil Adamczewski\",\"Sebastian Jaszczur\"]","published":"2025-07-04T12:23:45Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Mixture of Experts\",\"Transformer\",\"Large Language Model\"]","has_code":false}
