{"ID":2890116,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.19845","arxiv_id":"2507.19845","title":"MegatronApp: Efficient and Comprehensive Management on Distributed LLM Training","abstract":"The rapid escalation in the parameter count of large language models (LLMs) has transformed model training from a single-node endeavor into a highly intricate, cross-node activity. While frameworks such as Megatron-LM successfully integrate tensor (TP), pipeline (PP), and data (DP) parallelism to enable trillion-parameter training, they simultaneously expose practitioners to unprecedented systems-level challenges in performance optimization, diagnosis, and interpretability. MegatronApp is an open-source toolchain expressly designed to meet these challenges. It introduces four orthogonal, yet seamlessly composable modules--MegaScan, MegaFBD, MegaDPP, and MegaScope--that collectively elevate the reliability, efficiency, and transparency of production-scale training. This paper presents the motivation, architecture, and distinctive contributions of each module, and elucidates how their synergistic integration augments the Megatron-LM ecosystem.","short_abstract":"The rapid escalation in the parameter count of large language models (LLMs) has transformed model training from a single-node endeavor into a highly intricate, cross-node activity. While frameworks such as Megatron-LM successfully integrate tensor (TP), pipeline (PP), and data (DP) parallelism to enable trillion-parame...","url_abs":"https://arxiv.org/abs/2507.19845","url_pdf":"https://arxiv.org/pdf/2507.19845v1","authors":"[\"Bohan Zhao\",\"Guang Yang\",\"Shuo Chen\",\"Ruitao Liu\",\"Tingrui Zhang\",\"Yongchao He\",\"Wei Xu\"]","published":"2025-07-26T07:39:30Z","proceeding":"cs.DC","tasks":"[\"cs.DC\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}