{"ID":2876760,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.21613","arxiv_id":"2508.21613","title":"Chameleon: Adaptive Fault Tolerance for Distributed Training via Real-time Policy Selection","abstract":"Training large language models faces frequent interruptions due to various faults, demanding robust fault-tolerance. Existing backup-free methods, such as redundant computation, dynamic parallelism, and data rerouting, each incur performance penalties, whether from ongoing overhead, lengthy reconfigurations, or post-recovery inefficiencies. We propose Chameleon, an adaptive fault-tolerant system that intelligently selects optimal recovery strategies when a failure occurs. Chameleon achieves this through a unified performance model, expedient execution plan search, accurate performance estimation, and efficient communication optimizations. Experiments on a 32-card cluster show that Chameleon maintains a performance gap of within 11.00% between post-recovery and failure-free training, while preserving model convergence and efficient memory usage. Compared to state-of-the-art methods, Chameleon achieves up to 1.229x and 1.355x higher average throughput than Oobleck and Recycle, respectively.","short_abstract":"Training large language models faces frequent interruptions due to various faults, demanding robust fault-tolerance. Existing backup-free methods, such as redundant computation, dynamic parallelism, and data rerouting, each incur performance penalties, whether from ongoing overhead, lengthy reconfigurations, or post-re...","url_abs":"https://arxiv.org/abs/2508.21613","url_pdf":"https://arxiv.org/pdf/2508.21613v4","authors":"[\"Yuhang Zhou\",\"Zhibin Wang\",\"Peng Jiang\",\"Haoran Xia\",\"Junhe Lu\",\"Qianyu Jiang\",\"Rong Gu\",\"Hengxi Xu\",\"Xinjing Huang\",\"Guanghuan Fang\",\"Zhiheng Hu\",\"Jingyi Zhang\",\"Yongjin Cai\",\"Jian He\",\"Chen Tian\"]","published":"2025-08-29T13:22:11Z","proceeding":"cs.DC","tasks":"[\"cs.DC\"]","methods":"[\"Language Model\"]","has_code":false}
