{"ID":2881295,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.13337","arxiv_id":"2508.13337","title":"X-MoE: Enabling Scalable Training for Emerging Mixture-of-Experts Architectures on HPC Platforms","abstract":"Emerging expert-specialized Mixture-of-Experts (MoE) architectures, such as DeepSeek-MoE, deliver strong model quality through fine-grained expert segmentation and large top-k routing. However, their scalability is limited by substantial activation memory overhead and costly all-to-all communication. Furthermore, current MoE training systems - primarily optimized for NVIDIA GPUs - perform suboptimally on non-NVIDIA platforms, leaving significant computational potential untapped. In this work, we present X-MoE, a novel MoE training system designed to deliver scalable training performance for next-generation MoE architectures. X-MoE achieves this via several novel techniques, including efficient padding-free MoE training with cross-platform kernels, redundancy-bypassing dispatch, and hybrid parallelism with sequence-sharded MoE blocks. Our evaluation on the Frontier supercomputer, powered by AMD MI250X GPUs, shows that X-MoE scales DeepSeek-style MoEs up to 545 billion parameters across 1024 GPUs - 10x larger than the largest trainable model with existing methods under the same hardware budget, while maintaining high training throughput. The source code of X-MoE is available at https://github.com/Supercomputing-System-AI-Lab/X-MoE.","short_abstract":"Emerging expert-specialized Mixture-of-Experts (MoE) architectures, such as DeepSeek-MoE, deliver strong model quality through fine-grained expert segmentation and large top-k routing. However, their scalability is limited by substantial activation memory overhead and costly all-to-all communication. Furthermore, curre...","url_abs":"https://arxiv.org/abs/2508.13337","url_pdf":"https://arxiv.org/pdf/2508.13337v1","authors":"[\"Yueming Yuan\",\"Ahan Gupta\",\"Jianping Li\",\"Sajal Dash\",\"Feiyi Wang\",\"Minjia Zhang\"]","published":"2025-08-18T19:49:28Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.CL\",\"cs.DC\"]","methods":"[]","has_code":false,"code_links":[{"ID":610800,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2881295,"paper_url":"https://arxiv.org/abs/2508.13337","paper_title":"X-MoE: Enabling Scalable Training for Emerging Mixture-of-Experts Architectures on HPC Platforms","repo_url":"https://github.com/Supercomputing-System-AI-Lab/X-MoE","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}