{"ID":2865582,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.22840","arxiv_id":"2509.22840","title":"A Capacity-Based Rationale for Multi-Head Attention","abstract":"We study the capacity of the self-attention key-query channel: for a fixed budget, how many distinct token-token relations can a single layer reliably encode? We introduce Relational Graph Recognition, where the key-query channel encodes a directed graph and, given a context (a subset of the vertices), must recover the neighbors of each vertex in the context. We measure resources by the total key dimension $D_K = h\\,d_k$. In a tractable multi-head model, we prove matching information-theoretic lower bounds and upper bounds via explicit constructions showing that recovering a graph with $m'$ relations in $d_{\\text{model}}$-dimensional embeddings requires $D_K$ to grow essentially as $m'/d_{\\text{model}}$ up to logarithmic factors, and we obtain corresponding guarantees for scaled-softmax attention. This analysis yields a new, capacity-based rationale for multi-head attention: even in permutation graphs, where all queries attend to a single target, splitting a fixed $D_K$ budget into multiple heads increases capacity by reducing interference from embedding superposition. Controlled experiments mirror the theory, revealing sharp phase transitions at the predicted capacity, and the multi-head advantage persists when adding softmax normalization, value routing, and a full Transformer block trained with frozen GPT-2 embeddings.","short_abstract":"We study the capacity of the self-attention key-query channel: for a fixed budget, how many distinct token-token relations can a single layer reliably encode? We introduce Relational Graph Recognition, where the key-query channel encodes a directed graph and, given a context (a subset of the vertices), must recover the...","url_abs":"https://arxiv.org/abs/2509.22840","url_pdf":"https://arxiv.org/pdf/2509.22840v3","authors":"[\"Micah Adler\"]","published":"2025-09-26T18:47:35Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Transformer\"]","has_code":false}
