{"ID":3053180,"CreatedAt":"2026-06-04T04:41:36.695875263Z","UpdatedAt":"2026-06-05T11:43:53.432517148Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.04101","arxiv_id":"2606.04101","title":"UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing","abstract":"Large-scale expert parallelism (EP) is becoming pivotal for training and serving frontier MoE models, but it also amplifies device-level expert load imbalance into compute stragglers, token all-to-all bottlenecks, and activation-memory spikes. Existing balancers redistribute experts periodically based on historical load, which becomes unreliable for production deployments with non-stationary load patterns. We present UltraEP, the first exact-load, real-time balancer for large-EP MoE training and serving prefill on rack-scale nodes (RSNs). Built upon the extended scale-up connectivity of RSNs, UltraEP rebalances every microbatch and layer on critical paths, which requires nontrivial co-design of plan solving and expert replication communication to minimize exposed overhead. To this end, UltraEP eagerly reacts to post-gating load with efficient quota-driven planning, and executes the resulting irregular expert-state transfers with RSN-native persistent tile streaming and relay-based fan-out mitigation. Averaged across MoE models from 106B to 671B parameters in training and prefill, UltraEP achieves 94.3% of the force-balanced ideal throughput, delivering 1.49$\\times$ improvement over non-balancing, while reducing the final inter-rank imbalance from 1.30$-$4.01 to 1.01$-$1.04. Additionally, we validate UltraEP's scalability and robustness in production MoE training with 2560 GPUs.","short_abstract":"Large-scale expert parallelism (EP) is becoming pivotal for training and serving frontier MoE models, but it also amplifies device-level expert load imbalance into compute stragglers, token all-to-all bottlenecks, and activation-memory spikes. Existing balancers redistribute experts periodically based on historical loa...","url_abs":"https://arxiv.org/abs/2606.04101","url_pdf":"https://arxiv.org/pdf/2606.04101v1","authors":"[\"Xinming Wei\",\"Chao Jin\",\"Tuo Dai\",\"Yinmin Zhong\",\"Shan Yu\",\"Chengxu Yang\",\"Bingyang Wu\",\"Zili Zhang\",\"Jing Mai\",\"Qianchao Zhu\",\"Zhouyang Li\",\"Yuliang Liu\",\"Guojie Luo\"]","published":"2026-06-02T18:07:51Z","proceeding":"cs.DC","tasks":"[\"cs.DC\",\"cs.LG\"]","methods":"[]","has_code":false}