{"ID":2886743,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.02128","arxiv_id":"2508.02128","title":"Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models","abstract":"In the era of large language models (LLMs), N:M sparsity has emerged as a structured compression technique critical for accelerating inference. While prior work has primarily focused on weight sparsity, it often suffers from significant accuracy degradation. Activation sparsity, though promising, is typically training-dependent and faces challenges in generalization. To address these limitations, we introduce Amber Pruner, a training-free N:M activation sparsity method designed specifically for the prefill stage, targeting the acceleration of linear projection layers in LLMs. Extensive experiments across multiple models and sparsity ratios (2:4, 4:8, and 8:16) demonstrate that Amber Pruner can effectively sparsify and accelerate more than 55% of linear computations without requiring model retraining. To further enhance generality and efficiency, we propose Outstanding-sparse, a unified framework that integrates Amber Pruner with post-training W8A8 quantization. Our approach preserves strong performance across a range of downstream tasks, with notable advantages in generative tasks. This work pioneers a new frontier in activation sparsity, providing foundational insights that are poised to guide the co-evolution of algorithms and architectures in the design of next-generation AI systems.","short_abstract":"In the era of large language models (LLMs), N:M sparsity has emerged as a structured compression technique critical for accelerating inference. While prior work has primarily focused on weight sparsity, it often suffers from significant accuracy degradation. Activation sparsity, though promising, is typically training-...","url_abs":"https://arxiv.org/abs/2508.02128","url_pdf":"https://arxiv.org/pdf/2508.02128v1","authors":"[\"Tai An\",\"Ruwu Cai\",\"Yanzhe Zhang\",\"Yang Liu\",\"Hao Chen\",\"Pengcheng Xie\",\"Sheng Chang\",\"Yiwu Yao\",\"Gongyi Wang\"]","published":"2025-08-04T07:22:36Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
