{"ID":2843354,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.08158","arxiv_id":"2511.08158","title":"LOw-cOst yet High-Performant Sparse Matrix-Matrix Multiplication on Arm SME Architectures","abstract":"Sparse matrix-dense matrix multiplication (SpMM) is a critical kernel in both scientific computing and emerging graph learning workloads. The recent Armv9 architecture introduces Scalable Matrix Extension (SME), enabling tile-based matrix operations with high throughput. However, effectively exploiting both SME and traditional SIMD resources for unstructured sparse workloads remains an open challenge. To address this, we propose LOOPS, a hybrid execution framework that combines row-wise CSR-part with vector-wise BCSR-part layout, enabling cooperative utilization of vector instructions (NEON) and Scalable Matrix Extension (SME) resources. LOOPS supports multi-precision SpMM across FP64, FP32, and FP16 via an adaptive two-level parallelization scheme guided by a lightweight performance model. Experimental results on the entire SuiteSparse on an Apple's M4Pro CPU show that LOOPS achieves average speedups of 9.93$\\times$ (FP32)/14.4$\\times$ (FP64) against the CPU baseline TACO and 71.3$\\times$ (FP32)/54.8$\\times$ (FP64) with respect to Armadillo. A comparison of LOOPS running on the same CPU with two GPU methods (cuSPARSE, Magicube) executed on an NVIDIA A100 GPU show average speedups for LOOPS between 19.8$\\times$ and 33.5$\\times$, depending on the precision. Notably, LOOPS delivers significantly better energy efficiency than the GPU codes on the A100 GPU.","short_abstract":"Sparse matrix-dense matrix multiplication (SpMM) is a critical kernel in both scientific computing and emerging graph learning workloads. The recent Armv9 architecture introduces Scalable Matrix Extension (SME), enabling tile-based matrix operations with high throughput. However, effectively exploiting both SME and tra...","url_abs":"https://arxiv.org/abs/2511.08158","url_pdf":"https://arxiv.org/pdf/2511.08158v2","authors":"[\"Kelun Lei\",\"Hailong Yang\",\"Kaige Zhang\",\"Kejie Ma\",\"Yiqing Wang\",\"Xin You\",\"Yufan Xu\",\"Enrique S. Quintana-Orti\",\"Zhongzhi Luan\",\"Yi Liu\",\"Depei Qian\"]","published":"2025-11-11T12:10:41Z","proceeding":"cs.DC","tasks":"[\"cs.DC\"]","methods":"[]","has_code":false}
