{"ID":2825762,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.20198","arxiv_id":"2512.20198","title":"Designing Spatial Architectures for Sparse Attention: STAR Accelerator via Cross-Stage Tiling","abstract":"Large language models (LLMs) rely on self-attention for contextual understanding, demanding high-throughput inference and large-scale token parallelism (LTPP). Existing dynamic sparsity accelerators falter under LTPP scenarios due to stage-isolated optimizations. Revisiting the end-to-end sparsity acceleration flow, we identify an overlooked opportunity: cross-stage coordination can substantially reduce redundant computation and memory access. We propose STAR, a cross-stage compute- and memory-efficient algorithm-hardware co-design tailored for Transformer inference under LTPP. STAR introduces a leading-zero-based sparsity prediction using log-domain add-only operations to minimize prediction overhead. It further employs distributed sorting and a sorted updating FlashAttention mechanism, guided by a coordinated tiling strategy that enables fine-grained stage interaction for improved memory efficiency and latency. These optimizations are supported by a dedicated STAR accelerator architecture, achieving up to 9.2$\\times$ speedup and 71.2$\\times$ energy efficiency over A100, and surpassing SOTA accelerators by up to 16.1$\\times$ energy and 27.1$\\times$ area efficiency gains. Further, we deploy STAR onto a multi-core spatial architecture, optimizing dataflow and execution orchestration for ultra-long sequence processing. Architectural evaluation shows that, compared to the baseline design, Spatial-STAR achieves a 20.1$\\times$ throughput improvement.","short_abstract":"Large language models (LLMs) rely on self-attention for contextual understanding, demanding high-throughput inference and large-scale token parallelism (LTPP). Existing dynamic sparsity accelerators falter under LTPP scenarios due to stage-isolated optimizations. Revisiting the end-to-end sparsity acceleration flow, we...","url_abs":"https://arxiv.org/abs/2512.20198","url_pdf":"https://arxiv.org/pdf/2512.20198v2","authors":"[\"Huizheng Wang\",\"Taiquan Wei\",\"Hongbin Wang\",\"Zichuan Wang\",\"Xinru Tang\",\"Zhiheng Yue\",\"Shaojun Wei\",\"Yang Hu\",\"Shouyi Yin\"]","published":"2025-12-23T09:43:32Z","proceeding":"cs.AR","tasks":"[\"cs.AR\",\"eess.SP\"]","methods":"[\"Transformer\",\"Large Language Model\",\"Language Model\"]","has_code":false}