{"ID":2827769,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.17073","arxiv_id":"2512.17073","title":"Bandwidth-Efficient Adaptive Mixture-of-Experts via Low-Rank Compensation","abstract":"Mixture-of-Experts (MoE) models scale capacity via sparse activation but stress memory and bandwidth. Offloading alleviates GPU memory by fetching experts on demand, yet token-level routing causes irregular transfers that make inference I/O-bound. Static uniform quantization reduces traffic but degrades accuracy under aggressive compression by ignoring expert heterogeneity. We present Bandwidth-Efficient Adaptive Mixture-of-Experts via Low-Rank Compensation, which performs router-guided precision restoration using precomputed low-rank compensators. At inference time, our method transfers compact low-rank factors with Top-n (n\u003ck) experts per token and applies compensation to them, keeping others low-bit. Integrated with offloading on GPU and GPU-NDP systems, our method delivers a superior bandwidth-accuracy trade-off and improved throughput.","short_abstract":"Mixture-of-Experts (MoE) models scale capacity via sparse activation but stress memory and bandwidth. Offloading alleviates GPU memory by fetching experts on demand, yet token-level routing causes irregular transfers that make inference I/O-bound. Static uniform quantization reduces traffic but degrades accuracy under...","url_abs":"https://arxiv.org/abs/2512.17073","url_pdf":"https://arxiv.org/pdf/2512.17073v1","authors":"[\"Zhenyu Liu\",\"Yunzhen Liu\",\"Zehao Fan\",\"Garrett Gagnon\",\"Yayue Hou\",\"Nan Wu\",\"Yangwook Kang\",\"Liu Liu\"]","published":"2025-12-18T21:15:54Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[]","has_code":false}
