{"ID":2881087,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.12837","arxiv_id":"2508.12837","title":"Learning In-context n-grams with Transformers: Sub-n-grams Are Near-stationary Points","abstract":"Motivated by empirical observations of prolonged plateaus and stage-wise progression during training, we investigate the loss landscape of transformer models trained on in-context next-token prediction tasks. In particular, we focus on learning in-context $n$-gram language models under cross-entropy loss, and establish a sufficient condition for parameter configurations to be stationary points. We then construct a set of parameter configurations for a simplified transformer model that represent $k$-gram estimators (for $k \\leq n$), and show that the gradient of the population loss at these solutions vanishes in the limit of infinite sequence length and parameter norm. This reveals a key property of the loss landscape: {sub-$n$-grams are near-stationary points of the population cross-entropy loss}, offering theoretical insight into widely observed phenomena such as stage-wise learning dynamics and emergent phase transitions. These insights are further supported by numerical experiments that illustrate the learning dynamics of $n$-grams, characterized by discrete transitions between near-stationary solutions.","short_abstract":"Motivated by empirical observations of prolonged plateaus and stage-wise progression during training, we investigate the loss landscape of transformer models trained on in-context next-token prediction tasks. In particular, we focus on learning in-context $n$-gram language models under cross-entropy loss, and establish...","url_abs":"https://arxiv.org/abs/2508.12837","url_pdf":"https://arxiv.org/pdf/2508.12837v2","authors":"[\"Aditya Varre\",\"Gizem Yüce\",\"Nicolas Flammarion\"]","published":"2025-08-18T11:24:30Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Transformer\",\"Language Model\"]","has_code":false}