{"ID":2858748,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.06957","arxiv_id":"2510.06957","title":"Accelerating Sparse Ternary GEMM for Quantized ML on Apple Silicon","abstract":"Sparse Ternary General Matrix-Matrix Multiplication (GEMM) remains under-optimized in existing libraries for Apple Silicon CPUs. We present a Sparse Ternary GEMM kernel optimized specifically for Apple's M-series processors. We propose a set of architecture-aware optimizations, including a novel blocked and interleaved sparse data format to improve memory locality, strategies to increase Instruction-Level Parallelism (ILP), and NEON-based Single Instruction Multiple Data (SIMD) vectorization to exploit data-level parallelism. Our scalar implementation achieves up to a 5.98x performance increase over a traditional Ternary Compressed Sparse Column (TCSC) baseline for large matrices with 50% ternary nonzero values (sparsity), reaching up to a 50.2% of the processor's theoretical peak performance, and remains stable across varying sparsity levels. Our vectorized implementation delivers up to a 5.59x performance increase for large matrices with 25% sparsity, and remains stable across varying sparsity levels.","short_abstract":"Sparse Ternary General Matrix-Matrix Multiplication (GEMM) remains under-optimized in existing libraries for Apple Silicon CPUs. We present a Sparse Ternary GEMM kernel optimized specifically for Apple's M-series processors. We propose a set of architecture-aware optimizations, including a novel blocked and interleaved...","url_abs":"https://arxiv.org/abs/2510.06957","url_pdf":"https://arxiv.org/pdf/2510.06957v2","authors":"[\"Baraq Lipshitz\",\"Alessio Melone\",\"Charalampos Maraziaris\",\"Muhammed Bilal\"]","published":"2025-10-08T12:42:07Z","proceeding":"cs.PF","tasks":"[\"cs.PF\",\"cs.LG\"]","methods":"[]","has_code":false}
