{"ID":2863777,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.25041","arxiv_id":"2509.25041","title":"GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference","abstract":"Sparse Mixture of Experts (SMoE) enables scalable parameter growth in large language models (LLMs) by selectively activating a subset of experts, and its large parameter count necessitates distributed deployment for inference. However, distributed inference faces a critical dilemma: although communication overhead constitutes the primary bottleneck, reducing it often exacerbates computational load imbalance, leading to resource waste. In this paper, we present GRACE-MoE, which stands for Grouping and Replication with Locality-Aware Routing for SMoE inference. GRACE-MoE is a lossless co-optimization framework that integrates expert grouping to reduce communication and dynamic replication to correct load skew, together with locality-aware routing to resolve replica selection. To underpin this coordinated optimization in multi-node settings, GRACE-MoE adopts a hierarchical sparse communication design that reduces cross-node traffic while implicitly aligning execution across nodes, thereby mitigating synchronization overhead. Experiments on diverse models and multi-node, multi-GPU environments demonstrate that GRACE-MoE efficiently reduces end-to-end inference latency, achieving up to 4.66x speedup over existing systems, and the code will be released upon acceptance.","short_abstract":"Sparse Mixture of Experts (SMoE) enables scalable parameter growth in large language models (LLMs) by selectively activating a subset of experts, and its large parameter count necessitates distributed deployment for inference. However, distributed inference faces a critical dilemma: although communication overhead cons...","url_abs":"https://arxiv.org/abs/2509.25041","url_pdf":"https://arxiv.org/pdf/2509.25041v4","authors":"[\"Yu Han\",\"Lehan Pan\",\"Jie Peng\",\"Ziyang Tao\",\"Hanqi Zhu\",\"Wuyang Zhang\",\"Yanyong Zhang\"]","published":"2025-09-29T16:57:33Z","proceeding":"cs.DC","tasks":"[\"cs.DC\"]","methods":"[\"Mixture of Experts\",\"Large Language Model\",\"Language Model\"]","has_code":false}