{"ID":2895497,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.09394","arxiv_id":"2507.09394","title":"A Random Matrix Theory Perspective on the Learning Dynamics of Multi-head Latent Attention","abstract":"In this work, we study how multi-head latent attention (MLA), a popular strategy for compressing key/value memory, affects a transformer's internal capacity during pretraining. Using a lightweight suite of Marchenko-Pastur (MP) diagnostics, we analyze the spectrum of the $W_{Q}W_{K}^\\top$ gram matrix throughout training, comparing three variants: the standard multi-head attention (MHA) baseline, MLA-PreRoPE with rotary applied before compression, and MLA-Decoupled, which shares a single rotary sub-vector across all heads. Our random matrix analysis reveals \\textbf{three key findings:} \\textbf{ i)} capacity bottlenecks emerge locally: both MHA and MLA-PreRoPE exhibit sharp, early spikes in specific layers that persist and propagate, disrupting the balance between bulk and outlier directions; \\textbf{ ii)} these spikes coincide with rank collapse, concentrating the model's expressivity into narrow subspaces; \\textbf{ iii)} only the decoupled variant prevents this cascade, maintaining broad spectral support and suppressing outlier formation across layers. These results underscore that \\emph{how} rotary embeddings are applied is just as critical as \\emph{where} compression occurs. Sharing rotary components across heads mitigates spectral fragmentation and preserves representational capacity.","short_abstract":"In this work, we study how multi-head latent attention (MLA), a popular strategy for compressing key/value memory, affects a transformer's internal capacity during pretraining. Using a lightweight suite of Marchenko-Pastur (MP) diagnostics, we analyze the spectrum of the $W_{Q}W_{K}^\\top$ gram matrix throughout trainin...","url_abs":"https://arxiv.org/abs/2507.09394","url_pdf":"https://arxiv.org/pdf/2507.09394v1","authors":"[\"Nandan Kumar Jha\",\"Brandon Reagen\"]","published":"2025-07-12T20:31:07Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Transformer\"]","has_code":false}
