{"ID":2880942,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.12594","arxiv_id":"2508.12594","title":"FLARE: Fast Low-rank Attention Routing Engine","abstract":"The quadratic complexity of self-attention limits the scalability of transformers on long sequences. We introduce Fast Low-rank Attention Routing Engine (FLARE), a token-mixing operator that realizes low-rank attention by routing information through a small set of latent tokens. Each layer induces an input-input token mixing matrix of rank at most $M$ via a minimal encode-decode factorization implemented using only two standard scaled dot-product attention (SDPA) calls. Because the dominant ${O}(NM)$ computation is expressed purely in terms of standard SDPA, FLARE is compatible with fused attention kernels and avoids materializing $M\\times N$ projection matrices. FLARE further assigns disjoint latent slices to each attention head, yielding a mixture of head-specific low-rank pathways. Empirically, FLARE scales to one-million-point unstructured meshes on a single GPU, achieves state-of-the-art accuracy on PDE surrogate benchmarks, and outperforms general-purpose efficient-attention methods on the Long Range Arena suite. We additionally release a large-scale additive manufacturing benchmark dataset. Our code is available at https://github.com/vpuri3/FLARE.py.","short_abstract":"The quadratic complexity of self-attention limits the scalability of transformers on long sequences. We introduce Fast Low-rank Attention Routing Engine (FLARE), a token-mixing operator that realizes low-rank attention by routing information through a small set of latent tokens. Each layer induces an input-input token...","url_abs":"https://arxiv.org/abs/2508.12594","url_pdf":"https://arxiv.org/pdf/2508.12594v3","authors":"[\"Vedant Puri\",\"Aditya Joglekar\",\"Sri Datta Ganesh Bandreddi\",\"Kevin Ferguson\",\"Yu-hsuan Chen\",\"Yongjie Jessica Zhang\",\"Levent Burak Kara\"]","published":"2025-08-18T03:00:55Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Transformer\"]","has_code":false,"code_links":[{"ID":610745,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2880942,"paper_url":"https://arxiv.org/abs/2508.12594","paper_title":"FLARE: Fast Low-rank Attention Routing Engine","repo_url":"https://github.com/vpuri3/FLARE.py","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
