{"ID":2870280,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.12817","arxiv_id":"2509.12817","title":"SAGA: Selective Adaptive Gating for Efficient and Expressive Linear Attention","abstract":"While Transformer architecture excel at modeling long-range dependencies contributing to its widespread adoption in vision tasks the quadratic complexity of softmax-based attention mechanisms imposes a major bottleneck, particularly when processing high-resolution images. Linear attention presents a promising alternative by reformulating the attention computation from $(QK)V$ to $Q(KV)$, thereby reducing the complexity from $\\mathcal{O}(N^2)$ to $\\mathcal{O}(N)$ while preserving the global receptive field. However, most existing methods compress historical key-value (KV) information uniformly, which can lead to feature redundancy and the loss of directional alignment with the query (Q). This uniform compression results in low-rank $KV$ feature maps, contributing to a performance gap compared to softmax attention. To mitigate this limitation, we propose \\textbf{S}elective \\textbf{A}daptive \\textbf{GA}ting for Efficient and Expressive Linear Attention (SAGA) , which introduces input-adaptive learnable gates to selectively modulate information aggregation into the $KV$ feature map. These gates enhance semantic diversity and alleviate the low-rank constraint inherent in conventional linear attention. Additionally, we propose an efficient Hadamard-product decomposition method for gate computation, which introduces no additional memory overhead. Experiments demonstrate that SAGA achieves a 1.76$\\times$ improvement in throughput and a 2.69$\\times$ reduction in peak GPU memory compared to PVT-T at a resolution of $1280 \\times 1280$. Moreover, it improves top-1 accuracy by up to 4.4\\% on the ImageNet dataset, demonstrating both computational efficiency and model effectiveness.","short_abstract":"While Transformer architecture excel at modeling long-range dependencies contributing to its widespread adoption in vision tasks the quadratic complexity of softmax-based attention mechanisms imposes a major bottleneck, particularly when processing high-resolution images. Linear attention presents a promising alternati...","url_abs":"https://arxiv.org/abs/2509.12817","url_pdf":"https://arxiv.org/pdf/2509.12817v2","authors":"[\"Yuan Cao\",\"Dong Wang\"]","published":"2025-09-16T08:36:05Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Transformer\"]","has_code":false}
