{"ID":2890286,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.18889","arxiv_id":"2507.18889","title":"RailX: A Flexible, Scalable, and Low-Cost Network Architecture for Hyper-Scale LLM Training Systems","abstract":"Increasingly large AI workloads are calling for hyper-scale infrastructure; however, traditional interconnection network architecture is neither scalable nor cost-effective enough. Tree-based topologies such as the \\textit{Rail-optimized} network are extremely expensive, while direct topologies such as \\textit{Torus} have insufficient bisection bandwidth and flexibility. In this paper, we propose \\textit{RailX}, a reconfigurable network architecture based on intra-node direct connectivity and inter-node circuit switching. Nodes and optical switches are physically 2D-organized, achieving better scalability than existing centralized circuit switching networks. We propose a novel interconnection method based on \\textit{Hamiltonian Decomposition} theory to organize separate rail-based rings into \\textit{all-to-all} topology, simultaneously optimizing ring-collective and all-to-all communication. More than $100$K chips with hyper bandwidth can be interconnected with a flat switching layer, and the diameter is only $2\\sim4$ inter-node hops. The network cost per injection/All-Reduce bandwidth of \\textit{RailX} is less than $10\\%$ of the Fat-Tree, and the cost per bisection/All-to-All bandwidth is less than $50\\%$ of the Fat-Tree. Specifically, only $\\sim$\\$$1.3$B is required to interconnect 200K chips with 1.8TB bandwidth. \\textit{RailX} can also be used in the ML-as-a-service (MLaaS) scenario, where single or multiple training workloads with various shapes, scales, and parallelism strategies can be flexibly mapped, and failures can be worked around.","short_abstract":"Increasingly large AI workloads are calling for hyper-scale infrastructure; however, traditional interconnection network architecture is neither scalable nor cost-effective enough. Tree-based topologies such as the \\textit{Rail-optimized} network are extremely expensive, while direct topologies such as \\textit{Torus} h...","url_abs":"https://arxiv.org/abs/2507.18889","url_pdf":"https://arxiv.org/pdf/2507.18889v1","authors":"[\"Yinxiao Feng\",\"Tiancheng Chen\",\"Yuchen Wei\",\"Siyuan Shen\",\"Shiju Wang\",\"Wei Li\",\"Kaisheng Ma\",\"Torsten Hoefler\"]","published":"2025-07-25T02:16:08Z","proceeding":"cs.AR","tasks":"[\"cs.AR\",\"cs.DC\",\"cs.NI\"]","methods":"[\"Large Language Model\",\"Generative Adversarial Network\"]","has_code":false}
