{"ID":2861933,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.00606","arxiv_id":"2510.00606","title":"ElasWave: An Elastic-Native System for Scalable Hybrid-Parallel Training","abstract":"Large-scale LLM pretraining now runs across $10^5$--$10^6$ accelerators, making failures routine and elasticity mandatory. We posit that an elastic-native training system must jointly deliver (i) parameter consistency, (ii) low mean time to recovery (MTTR), (iii) high post-change throughput, and (iv) computation consistency. No prior system achieves all four simultaneously. To achieve these goals, we present ElasWave, which delivers per-step fault tolerance via multi-dimensional scheduling across graph, dataflow, DVFS, and RNG. ElasWave reshapes and reshards micro-batches while preserving the global batch size and gradient scale. It performs online pipeline resharding with asynchronous parameter migration and interleaves ZeRO partitions, reducing parameter recovery processes to disjoint rank-to-rank transfers. It further leverages DVFS to absorb pipeline bubbles and reshards RNG to keep computation consistency. Together, a dynamic communicator enables in-place communication group edits, while per-step in-memory snapshots support online verification and redistribution. We evaluate ElasWave on 96 NPUs and benchmark it against state-of-the-art baselines: throughput improves by $1.35\\times$ over ReCycle and $1.60\\times$ over TorchFT; communicator recovery completes within one second (up to $82\\times/3.6\\times$ faster than full/partial rebuilds); migration MTTR drops by as much as $51\\%$; and convergence deviation is reduced by approximately $78\\%$.","short_abstract":"Large-scale LLM pretraining now runs across $10^5$--$10^6$ accelerators, making failures routine and elasticity mandatory. We posit that an elastic-native training system must jointly deliver (i) parameter consistency, (ii) low mean time to recovery (MTTR), (iii) high post-change throughput, and (iv) computation consis...","url_abs":"https://arxiv.org/abs/2510.00606","url_pdf":"https://arxiv.org/pdf/2510.00606v3","authors":"[\"Xueze Kang\",\"Guangyu Xiang\",\"Yuxin Wang\",\"Hao Zhang\",\"Yuchu Fang\",\"Yuhang Zhou\",\"Zhenheng Tang\",\"Youhui Lv\",\"Eliran Maman\",\"Mark Wasserman\",\"Alon Zameret\",\"Zhipeng Bian\",\"Shushu Chen\",\"Zhiyou Yu\",\"Jin Wang\",\"Xiaoyu Wu\",\"Yang Zheng\",\"Chen Tian\",\"Xiaowen Chu\"]","published":"2025-10-01T07:34:39Z","proceeding":"cs.DC","tasks":"[\"cs.DC\"]","methods":"[\"Large Language Model\"]","has_code":false}
