{"ID":2884983,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.16584","arxiv_id":"2508.16584","title":"TMA-Adaptive FP8 Grouped GEMM: Eliminating Padding Requirements in Low-Precision Training and Inference on Hopper","abstract":"Current FP8 grouped GEMM implementations require padding each group to a fixed alignment (e.g., 128), incurring memory and computational overhead. We propose \\textit{TMA-Adaptive FP8 Grouped GEMM}, which eliminates padding by dynamically adapting to variable group dimensions via (1) a TMA descriptor pool with $\\log_2(block_M)$ preconfigured descriptors to handle all residual row cases through dynamic runtime selection and dual-phase load-store operations, achieving comprehensive coverage with minimal overhead, and (2) TMA-alignment-aware management to satisfy 16-byte global memory alignment and 128-byte shared memory alignment. Experiments demonstrate 1.7\\% to 20.4\\% speed up with up to 23.8\\% memory reduction compared to padding operation plus state-of-the-art FP8 grouped GEMM, while maintaining full numerical equivalence for valid data. The source code is publicly available at an anonymous repository: https://github.com/sukoncon/TMA-Adaptive-FP8-Grouped-GEMM.","short_abstract":"Current FP8 grouped GEMM implementations require padding each group to a fixed alignment (e.g., 128), incurring memory and computational overhead. We propose \\textit{TMA-Adaptive FP8 Grouped GEMM}, which eliminates padding by dynamically adapting to variable group dimensions via (1) a TMA descriptor pool with $\\log_2(b...","url_abs":"https://arxiv.org/abs/2508.16584","url_pdf":"https://arxiv.org/pdf/2508.16584v1","authors":"[\"Zhongling Su\",\"Rong Fu\",\"Weihan Cao\",\"Jianfei Gao\",\"Minxi Jin\",\"Zhilin Pei\",\"Hui Wang\"]","published":"2025-08-07T03:24:31Z","proceeding":"cs.AR","tasks":"[\"cs.AR\"]","methods":"[]","has_code":false,"code_links":[{"ID":611138,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2884983,"paper_url":"https://arxiv.org/abs/2508.16584","paper_title":"TMA-Adaptive FP8 Grouped GEMM: Eliminating Padding Requirements in Low-Precision Training and Inference on Hopper","repo_url":"https://github.com/sukoncon/TMA-Adaptive-FP8-Grouped-GEMM","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}