{"ID":3049963,"CreatedAt":"2026-06-04T02:13:16.786527022Z","UpdatedAt":"2026-06-06T15:28:14.12845936Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05014","arxiv_id":"2606.05014","title":"Depth-Attention: Cross-Layer Value Mixing for Language Models","abstract":"Self-attention selects information freely across the sequence, but across depth, Transformers merely add each layer's output to the residual stream, so later layers cannot selectively reuse earlier-layer representations. Recent cross-layer methods improve this flow but operate on hidden states outside attention, adding state beyond the key-value cache at inference--a cost that becomes increasingly salient as modern LLMs compress the cache with grouped-query and multi-head latent attention. We introduce Depth-Attention, which performs this selection inside the attention module itself: before a layer attends over the sequence, its query attends over the keys of earlier layers at the same token position and mixes their values into the value that self-attention then reads. Because Depth-Attention reuses the standard attention queries, keys, and value-cache slots, storing depth-mixed values in place of the original values, it adds no parameters and introduces no persistent inference state beyond the standard key-value cache--the same cache size as a vanilla decoder and less than hidden-state-based cross-layer methods. On Qwen3-style decoders at 1.5B and 3B parameters, Depth-Attention attains the lowest perplexity and the highest average downstream accuracy, improving over the vanilla Transformer by up to 2.3 accuracy points and surpassing strong cross-layer baselines in perplexity and average accuracy, while adding under 0.01% extra arithmetic FLOPs and no additional persistent inference state. The gains hold from 360M to 3B parameters and extend to looped Transformers.","short_abstract":"Self-attention selects information freely across the sequence, but across depth, Transformers merely add each layer's output to the residual stream, so later layers cannot selectively reuse earlier-layer representations. Recent cross-layer methods improve this flow but operate on hidden states outside attention, adding...","url_abs":"https://arxiv.org/abs/2606.05014","url_pdf":"https://arxiv.org/pdf/2606.05014v1","authors":"[\"Boyi Zeng\",\"Yiqin Hao\",\"Zitong Wang\",\"Shixiang Song\",\"He Li\",\"Feichen Song\",\"Yifan Liu\",\"Ziwei He\",\"Xinbing Wang\",\"Zhouhan Lin\"]","published":"2026-06-03T15:33:45Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Transformer\",\"Large Language Model\",\"Language Model\"]","has_code":false}
