{"ID":2831794,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.07782","arxiv_id":"2512.07782","title":"GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory","abstract":"Modern autoregressive models rely on attention, yet the Softmax full attention in Transformers scales quadratically with sequence length. Sliding Window Attention (SWA) achieves linear-time encoding/decoding by constraining the attention pattern, but under an \\textit{Associative Memory} interpretation, its difference-style update renders the training objective effectively \\emph{unbounded}. In contrast, Softmax attention normalizes updates, leading to \\emph{memory shrinkage and gradient vanishing}. We propose GatedFWA: a Memory-\\underline{Gated} (\\underline{F}lash) \\underline{W}indowed \\underline{A}ttention mechanism that preserves SWAs efficiency while stabilizing memory updates and making gradient flow controllable. In essence, GatedFWA accumulate a per-token/head gate into a decay bias added to the attention logits, acting as a learnable contraction in the memory recurrence. We implement a fused one-pass gate preprocessing and a FlashAttention-compatible kernel that injects the gate under a sliding mask, ensuring I/O efficiency and numerical stability. On language modelling benchmarks, GatedFWA delivers competitive throughput with negligible overhead and better use of global context, and it integrates cleanly with token compression/selection methods such as NSA and generalizes to various autoregressive domains.","short_abstract":"Modern autoregressive models rely on attention, yet the Softmax full attention in Transformers scales quadratically with sequence length. Sliding Window Attention (SWA) achieves linear-time encoding/decoding by constraining the attention pattern, but under an \\textit{Associative Memory} interpretation, its difference-s...","url_abs":"https://arxiv.org/abs/2512.07782","url_pdf":"https://arxiv.org/pdf/2512.07782v2","authors":"[\"Jiaxu Liu\",\"Yuhe Bai\",\"Xiangyu Yin\",\"Christos-Savvas Bouganis\"]","published":"2025-12-08T18:11:06Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Transformer\",\"Language Model\"]","has_code":false}
