{"ID":2864421,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.24006","arxiv_id":"2509.24006","title":"SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention","abstract":"In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank. This naturally suggests applying sparse acceleration to the first part and low-rank acceleration to the second. Based on this finding, we propose SLA (Sparse-Linear Attention), a trainable attention method that fuses sparse and linear attention to accelerate diffusion models. SLA classifies attention weights into critical, marginal, and negligible categories, applying O(N^2) attention to critical weights, O(N) attention to marginal weights, and skipping negligible ones. SLA combines these computations into a single GPU kernel and supports both forward and backward passes. With only a few fine-tuning steps using SLA, DiT models achieve a 20x reduction in attention computation, resulting in significant acceleration without loss of generation quality. Experiments show that SLA reduces attention computation by 95% without degrading end-to-end generation quality, outperforming baseline methods. In addition, we implement an efficient GPU kernel for SLA, which yields a 13.7x speedup in attention computation and a 2.2x end-to-end speedup in video generation on Wan2.1-1.3B. The code is available at https://github.com/thu-ml/SLA.","short_abstract":"In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with...","url_abs":"https://arxiv.org/abs/2509.24006","url_pdf":"https://arxiv.org/pdf/2509.24006v2","authors":"[\"Jintao Zhang\",\"Haoxu Wang\",\"Kai Jiang\",\"Shuo Yang\",\"Kaiwen Zheng\",\"Haocheng Xi\",\"Ziteng Wang\",\"Hongzhou Zhu\",\"Min Zhao\",\"Ion Stoica\",\"Joseph E. Gonzalez\",\"Jun Zhu\",\"Jianfei Chen\"]","published":"2025-09-28T17:58:59Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.CV\"]","methods":"[\"Diffusion Model\",\"Transformer\"]","has_code":false,"code_links":[{"ID":609152,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2864421,"paper_url":"https://arxiv.org/abs/2509.24006","paper_title":"SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention","repo_url":"https://github.com/thu-ml/SLA","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}