{"ID":3004892,"CreatedAt":"2026-06-03T03:09:48.883664427Z","UpdatedAt":"2026-06-05T11:10:57.854545281Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.03498","arxiv_id":"2606.03498","title":"Demystifying Pipeline Parallelism: First Theory for PipeDream","abstract":"Training modern machine learning models increasingly requires computation to be distributed across many accelerators. Data parallelism remains the default choice and is often paired with tensor-parallel sharding, but model parallelism becomes unavoidable once parameters, activations, or optimizer states no longer fit on a single device. This paper studies pipeline model parallelism through the lens of PipeDream (PD) (Harlap et al., 2018). Our first contribution is theoretical: we introduce Randomized PipeDream (RPD), a stale block-SGD abstraction that yields, to our knowledge, the first clean nonconvex convergence guarantee for a PD-style method. Our second contribution is a scaling diagnosis: we prove that the delay induced by steady-state PD grows as $S^2 - S/2 + O(1)$ for $S$ stages, so the stale-read contribution in the convergence theorem scales as $Θ(γ^2 S^4)$, equivalently as $Θ(S^4/K)$ in the tuned-rate form. Our third contribution is a comparison with LocalSGD, whose periodic model averaging trades weight staleness for synchronization bubbles. In our reported simulated-time experiments, the better-performing method depends on the objective: PD performs better on the quadratic objective and on a small language-modeling training-loss task, while for logistic regression LocalSGD becomes superior as the number of stages increases.","short_abstract":"Training modern machine learning models increasingly requires computation to be distributed across many accelerators. Data parallelism remains the default choice and is often paired with tensor-parallel sharding, but model parallelism becomes unavoidable once parameters, activations, or optimizer states no longer fit o...","url_abs":"https://arxiv.org/abs/2606.03498","url_pdf":"https://arxiv.org/pdf/2606.03498v1","authors":"[\"Ivan Ilin\",\"Peter Richtárik\"]","published":"2026-06-02T11:14:57Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.DC\"]","methods":"[]","has_code":false}