{"ID":2853376,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.16415","arxiv_id":"2510.16415","title":"MeCeFO: Enhancing LLM Training Robustness via Fault-Tolerant Optimization","abstract":"As distributed optimization scales to meet the demands of Large Language Model (LLM) training, hardware failures become increasingly non-negligible. Existing fault-tolerant training methods often introduce significant computational or memory overhead, demanding additional resources. To address this challenge, we propose Memory- and Computation-efficient Fault-tolerant Optimization (MeCeFO), a novel algorithm that ensures robust training with minimal overhead. When a computing node fails, MeCeFO seamlessly transfers its training task to a neighboring node while employing memory- and computation-efficient algorithmic optimizations to minimize the extra workload imposed on the neighboring node handling both tasks. MeCeFO leverages three key algorithmic designs: (i) Skip-connection, which drops the multi-head attention (MHA) module during backpropagation for memory- and computation-efficient approximation; (ii) Recomputation, which reduces activation memory in feedforward networks (FFNs); and (iii) Low-rank gradient approximation, enabling efficient estimation of FFN weight matrix gradients. Theoretically, MeCeFO matches the convergence rate of conventional distributed training, with a rate of $\\mathcal{O}(1/\\sqrt{nT})$, where n is the data parallelism size and T is the number of iterations. Empirically, MeCeFO maintains robust performance under high failure rates, incurring only a 4.18% drop in throughput, demonstrating 5.0$\\times$ to 6.7$\\times$ greater resilience than previous SOTA approaches. Codes are available at https://github.com/pkumelon/MeCeFO.","short_abstract":"As distributed optimization scales to meet the demands of Large Language Model (LLM) training, hardware failures become increasingly non-negligible. Existing fault-tolerant training methods often introduce significant computational or memory overhead, demanding additional resources. To address this challenge, we propos...","url_abs":"https://arxiv.org/abs/2510.16415","url_pdf":"https://arxiv.org/pdf/2510.16415v1","authors":"[\"Rizhen Hu\",\"Yutong He\",\"Ran Yan\",\"Mou Sun\",\"Binghang Yuan\",\"Kun Yuan\"]","published":"2025-10-18T09:15:57Z","proceeding":"cs.DC","tasks":"[\"cs.DC\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":608076,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2853376,"paper_url":"https://arxiv.org/abs/2510.16415","paper_title":"MeCeFO: Enhancing LLM Training Robustness via Fault-Tolerant Optimization","repo_url":"https://github.com/pkumelon/MeCeFO","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}