{"ID":2863695,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.24914","arxiv_id":"2509.24914","title":"Single-Head Attention in High Dimensions: A Theory of Generalization, Weights Spectra, and Scaling Laws","abstract":"Trained attention layers exhibit striking and reproducible spectral structure of the weights, including low-rank collapse, bulk deformation, and isolated spectral outliers, yet the origin of these phenomena and their implications for generalization remain poorly understood. We study empirical risk minimization in a single-head tied-attention layer trained on synthetic high-dimensional sequence tasks generated from the attention-indexed model. Using tools from random matrix theory, spin-glass theory, and approximate message passing, we obtain an exact high-dimensional characterization of training and test error, interpolation and recovery thresholds, and the spectrum of the key and query matrices. Our theory predicts the full singular-value distribution of the trained query-key map, including low-rank structure and isolated spectral outliers, in qualitative agreement with observations in more realistic transformers. Finally, for targets with power-law spectra, we show that learning proceeds through sequential spectral recovery, leading to the emergence of power-law scaling laws.","short_abstract":"Trained attention layers exhibit striking and reproducible spectral structure of the weights, including low-rank collapse, bulk deformation, and isolated spectral outliers, yet the origin of these phenomena and their implications for generalization remain poorly understood. We study empirical risk minimization in a sin...","url_abs":"https://arxiv.org/abs/2509.24914","url_pdf":"https://arxiv.org/pdf/2509.24914v2","authors":"[\"Fabrizio Boncoraglio\",\"Vittorio Erba\",\"Emanuele Troiani\",\"Yizhou Xu\",\"Florent Krzakala\",\"Lenka Zdeborová\"]","published":"2025-09-29T15:19:31Z","proceeding":"stat.ML","tasks":"[\"stat.ML\",\"cond-mat.dis-nn\",\"cs.IT\",\"cs.LG\"]","methods":"[\"Transformer\"]","has_code":false}
