{"ID":2844332,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.06313","arxiv_id":"2511.06313","title":"Precision-Scalable Microscaling Datapaths with Optimized Reduction Tree for Efficient NPU Integration","abstract":"Emerging continual learning applications necessitate next-generation neural processing unit (NPU) platforms to support both training and inference operations. The promising Microscaling (MX) standard enables narrow bit-widths for inference and large dynamic ranges for training. However, existing MX multiply-accumulate (MAC) designs face a critical trade-off: integer accumulation requires expensive conversions from narrow floating-point products, while FP32 accumulation suffers from quantization losses and costly normalization. To address these limitations, we propose a hybrid precision-scalable reduction tree for MX MACs that combines the benefits of both approaches, enabling efficient mixed-precision accumulation with controlled accuracy relaxation. Moreover, we integrate an 8x8 array of these MACs into the state-of-the-art (SotA) NPU integration platform, SNAX, to provide efficient control and data transfer to our optimized precision-scalable MX datapath. We evaluate our design both on MAC and system level and compare it to the SotA. Our integrated system achieves an energy efficiency of 657, 1438-1675, and 4065 GOPS/W, respectively, for MXINT8, MXFP8/6, and MXFP4, with a throughput of 64, 256, and 512 GOPS.","short_abstract":"Emerging continual learning applications necessitate next-generation neural processing unit (NPU) platforms to support both training and inference operations. The promising Microscaling (MX) standard enables narrow bit-widths for inference and large dynamic ranges for training. However, existing MX multiply-accumulate...","url_abs":"https://arxiv.org/abs/2511.06313","url_pdf":"https://arxiv.org/pdf/2511.06313v1","authors":"[\"Stef Cuyckens\",\"Xiaoling Yi\",\"Robin Geens\",\"Joren Dumoulin\",\"Martin Wiesner\",\"Chao Fang\",\"Marian Verhelst\"]","published":"2025-11-09T10:24:17Z","proceeding":"cs.AR","tasks":"[\"cs.AR\",\"cs.AI\",\"cs.LG\",\"eess.SP\"]","methods":"[]","has_code":false}
