{"ID":2832123,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.07011","arxiv_id":"2512.07011","title":"Block Sparse Flash Attention","abstract":"Modern large language models increasingly require long contexts for reasoning and multi-document tasks, but attention's quadratic complexity creates a severe computational bottleneck. We present Block-Sparse FlashAttention (BSFA), a drop-in replacement that accelerates long-context inference while preserving model quality. Unlike methods that predict importance before computing scores, BSFA computes exact query-key similarities to select the top-k most important value blocks for each query. By comparing per-block maximum scores against calibrated thresholds, we skip approximately 50% of the computation and memory transfers for pruned blocks. Our training-free approach requires only a one-time threshold calibration on a small dataset to learn the per-layer and per-head attention score distributions. We provide a CUDA kernel implementation that can be used as a drop-in replacement for FlashAttention. On Llama-3.1-8B, BSFA achieves up to 1.10x speedup on real-world reasoning benchmarks and up to 1.24x for needle-in-a-haystack retrieval tasks while maintaining above 99% baseline accuracy, with certain configurations even improving accuracy by focusing on the most relevant content, substantially outperforming existing sparse attention methods. The implementation is available at https://github.com/Danielohayon/Block-Sparse-Flash-Attention","short_abstract":"Modern large language models increasingly require long contexts for reasoning and multi-document tasks, but attention's quadratic complexity creates a severe computational bottleneck. We present Block-Sparse FlashAttention (BSFA), a drop-in replacement that accelerates long-context inference while preserving model qual...","url_abs":"https://arxiv.org/abs/2512.07011","url_pdf":"https://arxiv.org/pdf/2512.07011v1","authors":"[\"Daniel Ohayon\",\"Itay Lamprecht\",\"Itay Hubara\",\"Israel Cohen\",\"Daniel Soudry\",\"Noam Elata\"]","published":"2025-12-07T21:20:12Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.CL\",\"cs.PF\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":606201,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2832123,"paper_url":"https://arxiv.org/abs/2512.07011","paper_title":"Block Sparse Flash Attention","repo_url":"https://github.com/Danielohayon/Block-Sparse-Flash-Attention","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
