{"ID":2854087,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.15596","arxiv_id":"2510.15596","title":"PRISM: Probabilistic Runtime Insights and Scalable Performance Modeling for Large-Scale Distributed Training","abstract":"Large model training beyond tens of thousands of GPUs is an uncharted territory. At such scales, disruptions to the training process are not a matter of if, but a matter of when -- a stochastic process degrading training productivity. Dynamic runtime variation will become increasingly more frequent as training scales up and as GPUs are operated in increasingly power-limited and thermally-stressed environments. At the 64,000+ GPU scale, we already observe 9% GPU time variability for frontier foundation model training. Motivated by our analysis and the large design space around performance variability, we present PRISM -- a performance modeling framework that captures the stochastic nature of large-scale distributed training. The core of PRISM is a statistical method that quantifies probabilistic guarantees on training time. Using PRISM, we explore the design and optimization space of distributed training, enabling principled, variability-aware decisions that improve performance and system efficiency at scale.","short_abstract":"Large model training beyond tens of thousands of GPUs is an uncharted territory. At such scales, disruptions to the training process are not a matter of if, but a matter of when -- a stochastic process degrading training productivity. Dynamic runtime variation will become increasingly more frequent as training scales u...","url_abs":"https://arxiv.org/abs/2510.15596","url_pdf":"https://arxiv.org/pdf/2510.15596v2","authors":"[\"Alicia Golden\",\"Michael Kuchnik\",\"Samuel Hsia\",\"Zachary DeVito\",\"Gu-Yeon Wei\",\"David Brooks\",\"Carole-Jean Wu\"]","published":"2025-10-17T12:41:37Z","proceeding":"cs.DC","tasks":"[\"cs.DC\"]","methods":"[]","has_code":false}