{"ID":3050385,"CreatedAt":"2026-06-04T02:13:16.786527022Z","UpdatedAt":"2026-06-05T07:50:16.0004273Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.04048","arxiv_id":"2606.04048","title":"Unlocking Feature Learning in Gated Delta Networks at Scale","abstract":"Training and scaling Large Language Models demand enormous computational resources, motivating both efficient sub-quadratic architectures and principled hyperparameter tuning methods. While the Maximal Update Parametrization ($μ$P) has enabled zero-shot hyperparameter transfer for standard Transformers, its extension to linear models, particularly those with structured state transitions and complicated architectures, remains largely unexplored. By rigorously propagating coordinate-size estimates through the forward pass, gating mechanisms, and recurrent state dynamics, we derive the scaling rules for Gated Delta Network. Experiments on language-model pre-training confirm that our configurations enable stable learning-rate transfer across model widths under both AdamW and SGD, whereas standard parametrization fails to transfer, validating the correctness and practical utility of our analysis.","short_abstract":"Training and scaling Large Language Models demand enormous computational resources, motivating both efficient sub-quadratic architectures and principled hyperparameter tuning methods. While the Maximal Update Parametrization ($μ$P) has enabled zero-shot hyperparameter transfer for standard Transformers, its extension t...","url_abs":"https://arxiv.org/abs/2606.04048","url_pdf":"https://arxiv.org/pdf/2606.04048v1","authors":"[\"Yifeng Liu\",\"Quanquan Gu\"]","published":"2026-06-02T08:45:24Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Transformer\",\"Language Model\"]","has_code":false}
