{"ID":2859133,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.05554","arxiv_id":"2510.05554","title":"Critical attention scaling in long-context transformers","abstract":"As large language models scale to longer contexts, attention layers suffer from a fundamental pathology: attention scores collapse toward uniformity as context length $n$ increases, causing tokens to cluster excessively, a phenomenon known as rank-collapse. While $\\textit{attention scaling}$ effectively addresses this deficiency by rescaling attention scores with a polylogarithmic factor $β_n$, theoretical justification for this approach remains lacking. We analyze a simplified yet tractable model that magnifies the effect of attention scaling. In this model, attention exhibits a phase transition governed by the scaling factor $β_n$: insufficient scaling collapses all tokens to a single direction, while excessive scaling reduces attention to identity, thereby eliminating meaningful interactions between tokens. Our main result identifies the critical scaling $β_n \\asymp \\log n$ and provides a rigorous justification for attention scaling in YaRN and Qwen, clarifying why logarithmic scaling maintains sparse, content-adaptive attention at large context lengths.","short_abstract":"As large language models scale to longer contexts, attention layers suffer from a fundamental pathology: attention scores collapse toward uniformity as context length $n$ increases, causing tokens to cluster excessively, a phenomenon known as rank-collapse. While $\\textit{attention scaling}$ effectively addresses this...","url_abs":"https://arxiv.org/abs/2510.05554","url_pdf":"https://arxiv.org/pdf/2510.05554v1","authors":"[\"Shi Chen\",\"Zhengjiang Lin\",\"Yury Polyanskiy\",\"Philippe Rigollet\"]","published":"2025-10-07T03:51:57Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.DM\",\"math.CA\"]","methods":"[\"Transformer\",\"Language Model\"]","has_code":false}