{"ID":2851522,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.19262","arxiv_id":"2510.19262","title":"RailS: Load Balancing for All-to-All Communication in Distributed Mixture-of-Experts Training","abstract":"Training Mixture-of-Experts (MoE) models introduces sparse and highly imbalanced all-to-all communication that dominates iteration time. Conventional load-balancing methods fail to exploit the deterministic topology of Rail architectures, leaving multi-NIC bandwidth underutilized. We present RailS, a distributed load-balancing framework that minimizes all-to-all completion time in MoE training. RailS leverages the Rail topology's symmetry to prove that uniform sending ensures uniform receiving, transforming global coordination into local scheduling. Each node independently executes a Longest Processing Time First (LPT) spraying scheduler to proactively balance traffic using local information. RailS activates N parallel rails for fine-grained, topology-aware multipath transmission. Across synthetic and real-world MoE workloads, RailS improves bus bandwidth by 20%--78% and reduces completion time by 17%--78%. For Mixtral workloads, it shortens iteration time by 18%--40% and achieves near-optimal load balance, fully exploiting architectural parallelism in distributed training.","short_abstract":"Training Mixture-of-Experts (MoE) models introduces sparse and highly imbalanced all-to-all communication that dominates iteration time. Conventional load-balancing methods fail to exploit the deterministic topology of Rail architectures, leaving multi-NIC bandwidth underutilized. We present RailS, a distributed load-b...","url_abs":"https://arxiv.org/abs/2510.19262","url_pdf":"https://arxiv.org/pdf/2510.19262v2","authors":"[\"Heng Xu\",\"Zhiwei Yu\",\"Chengze Du\",\"Ying Zhou\",\"Letian Li\",\"Haojie Wang\",\"Weiqiang Cheng\",\"Jialong Li\"]","published":"2025-10-22T05:43:13Z","proceeding":"cs.DC","tasks":"[\"cs.DC\",\"cs.NI\"]","methods":"[]","has_code":false}
