{"ID":2829033,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.13488","arxiv_id":"2512.13488","title":"SIGMA: An AI-Empowered Training Stack on Early-Life Hardware","abstract":"An increasing variety of AI accelerators is being considered for large-scale training. However, enabling large-scale training on early-life AI accelerators faces three core challenges: frequent system disruptions and undefined failure modes that undermine reliability; numerical errors and training instabilities that threaten correctness and convergence; and the complexity of parallelism optimization combined with unpredictable local noise that degrades efficiency. To address these challenges, SIGMA is an open-source training stack designed to improve the reliability, stability, and efficiency of large-scale distributed training on early-life AI hardware. The core of this initiative is the LUCIA TRAINING PLATFORM (LTP), the system optimized for clusters with early-life AI accelerators. Since its launch in March 2025, LTP has significantly enhanced training reliability and operational productivity. Over the past five months, it has achieved an impressive 94.45% effective cluster accelerator utilization, while also substantially reducing node recycling and job-recovery times. Building on the foundation of LTP, the LUCIA TRAINING FRAMEWORK (LTF) successfully trained SIGMA-MOE, a 200B MoE model, using 2,048 AI accelerators. This effort delivered remarkable stability and efficiency outcomes, achieving 21.08% MFU, state-of-the-art downstream accuracy, and encountering only one stability incident over a 75-day period. Together, these advances establish SIGMA, which not only tackles the critical challenges of large-scale training but also establishes a new benchmark for AI infrastructure and platform innovation, offering a robust, cost-effective alternative to prevailing established accelerator stacks and significantly advancing AI capabilities and scalability. The source code of SIGMA is available at https://github.com/microsoft/LuciaTrainingPlatform.","short_abstract":"An increasing variety of AI accelerators is being considered for large-scale training. However, enabling large-scale training on early-life AI accelerators faces three core challenges: frequent system disruptions and undefined failure modes that undermine reliability; numerical errors and training instabilities that th...","url_abs":"https://arxiv.org/abs/2512.13488","url_pdf":"https://arxiv.org/pdf/2512.13488v1","authors":"[\"Lei Qu\",\"Lianhai Ren\",\"Peng Cheng\",\"Rui Gao\",\"Ruizhe Wang\",\"Tianyu Chen\",\"Xiao Liu\",\"Xingjian Zhang\",\"Yeyun Gong\",\"Yifan Xiong\",\"Yucheng Ding\",\"Yuting Jiang\",\"Zhenghao Lin\",\"Zhongxin Guo\",\"Ziyue Yang\"]","published":"2025-12-15T16:24:32Z","proceeding":"cs.DC","tasks":"[\"cs.DC\",\"cs.CL\"]","methods":"[]","has_code":false,"code_links":[{"ID":605923,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2829033,"paper_url":"https://arxiv.org/abs/2512.13488","paper_title":"SIGMA: An AI-Empowered Training Stack on Early-Life Hardware","repo_url":"https://github.com/microsoft/LuciaTrainingPlatform","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}