FLARE: Fast Low-rank Attention Routing Engine

cs.LG arXiv:2508.12594
View PDF arXiv JSON

Abstract

The quadratic complexity of self-attention limits the scalability of transformers on long sequences. We introduce Fast Low-rank Attention Routing Engine (FLARE), a token-mixing operator that realizes low-rank attention by routing information through a small set of latent tokens. Each layer induces an input-input token mixing matrix of rank at most $M$ via a minimal encode-decode factorization implemented using only two standard scaled dot-product attention (SDPA) calls. Because the dominant ${O}(NM)$ computation is expressed purely in terms of standard SDPA, FLARE is compatible with fused attention kernels and avoids materializing $M\times N$ projection matrices. FLARE further assigns disjoint latent slices to each attention head, yielding a mixture of head-specific low-rank pathways. Empirically, FLARE scales to one-million-point unstructured meshes on a single GPU, achieves state-of-the-art accuracy on PDE surrogate benchmarks, and outperforms general-purpose efficient-attention methods on the Long Range Arena suite. We additionally release a large-scale additive manufacturing benchmark dataset. Our code is available at https://github.com/vpuri3/FLARE.py.

PDF Viewer