Optimizing Frequent Checkpointing via Low-Cost Differential for Distributed Training Systems

Sep 4, 2025 cs.DC arXiv:2509.04084

Abstract

Distributed training of large deep-learning models often leads to failures, so checkpointing is commonly employed for recovery. State-of-the-art studies focus on frequent checkpointing for fast recovery from failures. However, it generates numerous checkpoints, incurring substantial costs and thus degrading training performance. Recently, differential checkpointing has been proposed to reduce costs, but it is limited to recommendation systems, so its application to general distributed training systems remains unexplored. We propose \sysname, an efficient frequent checkpointing framework that \textit{reuses} compressed gradients, serving as differential checkpoints to reduce cost. Furthermore, \sysname incorporates a batched gradient write optimization to persist these differentials to storage efficiently. It also dynamically tunes both the checkpoint frequency and the batching size to maximize performance. To enhance \sysname under non-compression scenarios, we further propose \sysnameplus, which incorporates a layer-wise-reuse snapshotting strategy, along with an incremental-merging persistence strategy. Experiments on various workloads show that \sysname and \sysnameplus can reduce the training time by up to 89.2\% and 81.2\% with checkpointing frequency up to per iteration.

Abstract

PDF Viewer