{"ID":2850544,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.21270","arxiv_id":"2510.21270","title":"Sparser Block-Sparse Attention via Token Permutation","abstract":"Scaling the context length of large language models (LLMs) offers significant benefits but is computationally expensive. This expense stems primarily from the self-attention mechanism, whose $O(N^2)$ complexity with respect to sequence length presents a major bottleneck for both memory and latency. Fortunately, the attention matrix is often sparse, particularly for long sequences, suggesting an opportunity for optimization. Block-sparse attention has emerged as a promising solution that partitions sequences into blocks and skips computation for a subset of these blocks. However, the effectiveness of this method is highly dependent on the underlying attention patterns, which can lead to sub-optimal block-level sparsity. For instance, important key tokens for queries within a single block may be scattered across numerous other blocks, leading to computational redundancy. In this work, we propose Permuted Block-Sparse Attention (\\textbf{PBS-Attn}), a plug-and-play method that leverages the permutation properties of attention to increase block-level sparsity and enhance the computational efficiency of LLM prefilling. We conduct comprehensive experiments on challenging real-world long-context datasets, demonstrating that PBS-Attn consistently outperforms existing block-sparse attention methods in model accuracy and closely matches the full attention baseline. Powered by our custom permuted-FlashAttention kernels, PBS-Attn achieves an end-to-end speedup of up to $2.75\\times$ in long-context prefilling, confirming its practical viability. Code available at https://github.com/xinghaow99/pbs-attn","short_abstract":"Scaling the context length of large language models (LLMs) offers significant benefits but is computationally expensive. This expense stems primarily from the self-attention mechanism, whose $O(N^2)$ complexity with respect to sequence length presents a major bottleneck for both memory and latency. Fortunately, the att...","url_abs":"https://arxiv.org/abs/2510.21270","url_pdf":"https://arxiv.org/pdf/2510.21270v2","authors":"[\"Xinghao Wang\",\"Pengyu Wang\",\"Dong Zhang\",\"Chenkun Tan\",\"Shaojun Zhou\",\"Zhaoxiang Liu\",\"Shiguo Lian\",\"Fangxu Liu\",\"Kai Song\",\"Xipeng Qiu\"]","published":"2025-10-24T09:11:50Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.CV\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":607810,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2850544,"paper_url":"https://arxiv.org/abs/2510.21270","paper_title":"Sparser Block-Sparse Attention via Token Permutation","repo_url":"https://github.com/xinghaow99/pbs-attn","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
