{"ID":2862329,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.01450","arxiv_id":"2510.01450","title":"Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression","abstract":"Transformer architectures have achieved remarkable success in various domains. While efficient alternatives to Softmax Attention have been widely studied, the search for more expressive mechanisms grounded in theoretical insight-even at greater computational cost-has been relatively underexplored. In this work, we bridge this gap by proposing Local Linear Attention (LLA), a novel attention mechanism derived from nonparametric statistics through the lens of test-time regression. First, we show that LLA offers theoretical advantages over Linear and Softmax Attention for associative memory via a bias-variance trade-off analysis. Next, we address its computational challenges and propose two memory-efficient primitives to tackle the $Θ(n^2 d)$ and $Θ(n d^2)$ complexity. We then introduce FlashLLA, a hardware-efficient, blockwise algorithm that enables scalable and parallel computation on modern accelerators. In addition, we implement and profile a customized inference kernel that significantly reduces memory overheads. Finally, we empirically validate the advantages and limitations of LLA on test-time regression, in-context regression, associative recall and state tracking tasks. Experiment results demonstrate that LLA effectively adapts to non-stationarity, outperforming strong baselines in test-time training and in-context learning, and exhibiting promising evidence for its scalability and applicability in large-scale models. Code is available at https://github.com/Yifei-Zuo/Flash-LLA.","short_abstract":"Transformer architectures have achieved remarkable success in various domains. While efficient alternatives to Softmax Attention have been widely studied, the search for more expressive mechanisms grounded in theoretical insight-even at greater computational cost-has been relatively underexplored. In this work, we brid...","url_abs":"https://arxiv.org/abs/2510.01450","url_pdf":"https://arxiv.org/pdf/2510.01450v1","authors":"[\"Yifei Zuo\",\"Yutong Yin\",\"Zhichen Zeng\",\"Ang Li\",\"Banghua Zhu\",\"Zhaoran Wang\"]","published":"2025-10-01T20:42:21Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Transformer\"]","has_code":false,"code_links":[{"ID":608889,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2862329,"paper_url":"https://arxiv.org/abs/2510.01450","paper_title":"Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression","repo_url":"https://github.com/Yifei-Zuo/Flash-LLA","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
