{"ID":2898903,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.03117","arxiv_id":"2507.03117","title":"BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers","abstract":"The energy consumption of large-scale ML models is dominated by data movement, shuffling billions of parameters across memory hierarchies and data centers. Sparsification offers a principled way to mitigate these costs by pruning redundant weights and activations, thereby reducing data movement. Effective sparsification to prune redundant parameters is still challenging: existing methods incur significant accuracy degradation, performance overhead, or both. We introduce (Bl)ock (a)nd (S)parse (T)ransformers (BLaST), a general, robust, and reliable method for sparsification, applicable to linear layers in all settings. Our method iteratively sparsifies weight matrices into a block sparsity pattern suitable for efficient sparse matrix-matrix (SpMM) multiplication. BLaST achieves up to 95% sparsity in MLP weights with negligible accuracy loss (majority \u003c2.25%). We show a 2.2x inference speedup for Llama 3.2 with 16 GPUs, and up to 4.45x reduction in inference memory footprint resulting in a 2.9x reduction in GPU setup and operating costs.","short_abstract":"The energy consumption of large-scale ML models is dominated by data movement, shuffling billions of parameters across memory hierarchies and data centers. Sparsification offers a principled way to mitigate these costs by pruning redundant weights and activations, thereby reducing data movement. Effective sparsificatio...","url_abs":"https://arxiv.org/abs/2507.03117","url_pdf":"https://arxiv.org/pdf/2507.03117v2","authors":"[\"Patrik Okanovic\",\"Sameer Deshmukh\",\"Grzegorz Kwasniewski\",\"Yi Zhu\",\"Haruto Fujii\",\"Sakina Fatima\",\"Maciej Besta\",\"Kentaro Katayama\",\"Takumi Honda\",\"Yusuke Nagasaka\",\"Torsten Hoefler\"]","published":"2025-07-03T18:53:54Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.DC\"]","methods":"[\"Transformer\"]","has_code":false}
