{"ID":2894285,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.11067","arxiv_id":"2507.11067","title":"MMStencil: Optimizing High-order Stencils on Multicore CPU using Matrix Unit","abstract":"Matrix-accelerated stencil computation is a hot research topic, yet its application to three-dimensional (3D) high-order stencils and HPC remains underexplored. With the emergence of matrix units on multicore CPUs, we analyze matrix-based acceleration strategies and tailor an optimal approach for 3D high-order stencils. We introduce algorithmic optimizations based on SIMD and matrix units to address strided memory accesses, alignment conflicts, and redundant accesses. We propose memory optimizations to boost on-package memory efficiency, and a novel multi-thread parallelism paradigm to overcome data-sharing challenges caused by the absence of shared data caches. MMStencil sustains consistently high hardware utilization across diverse stencil shapes and dimensions. Our DMA-based inter-NUMA communication further mitigates NUMA effects and MPI limitations in hybrid parallelism. Combining all the innovations, MMStencil outperforms state-of-the-art libraries on Nvidia A100 GPGPU by up to 2.1x. Moreover, the performance improvements translate directly to real-world HPC applications and enable RTM applications to yield 1.8x speedup versus a highly optimized industrial Nvidia A100 GPGPU version.","short_abstract":"Matrix-accelerated stencil computation is a hot research topic, yet its application to three-dimensional (3D) high-order stencils and HPC remains underexplored. With the emergence of matrix units on multicore CPUs, we analyze matrix-based acceleration strategies and tailor an optimal approach for 3D high-order stencils...","url_abs":"https://arxiv.org/abs/2507.11067","url_pdf":"https://arxiv.org/pdf/2507.11067v1","authors":"[\"Yinuo Wang\",\"Tianqi Mao\",\"Lin Gan\",\"Wubing Wan\",\"Zeyu Song\",\"Jiayu Fu\",\"Lanke He\",\"Wenqiang Wang\",\"Zekun Yin\",\"Wei Xue\",\"Guangwen Yang\"]","published":"2025-07-15T08:00:11Z","proceeding":"cs.DC","tasks":"[\"cs.DC\"]","methods":"[]","has_code":false}