{"ID":2852770,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.17519","arxiv_id":"2510.17519","title":"MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models","abstract":"In recent years, large-scale generative models for visual content (\\textit{e.g.,} images, videos, and 3D objects/scenes) have made remarkable progress. However, training large-scale video generation models remains particularly challenging and resource-intensive due to cross-modal text-video alignment, the long sequences involved, and the complex spatiotemporal dependencies. To address these challenges, we present a training framework that optimizes four pillars: (i) data processing, (ii) model architecture, (iii) training strategy, and (iv) infrastructure for large-scale video generation models. These optimizations delivered significant efficiency gains and performance improvements across all stages of data preprocessing, video compression, parameter scaling, curriculum-based pretraining, and alignment-focused post-training. Our resulting model, MUG-V 10B, matches recent state-of-the-art video generators overall and, on e-commerce-oriented video generation tasks, surpasses leading open-source baselines in human evaluations. More importantly, we open-source the complete stack, including model weights, Megatron-Core-based large-scale training code, and inference pipelines for video generation and enhancement. To our knowledge, this is the first public release of large-scale video generation training code that exploits Megatron-Core to achieve high training efficiency and near-linear multi-node scaling, details are available in https://github.com/Shopee-MUG/MUG-V.","short_abstract":"In recent years, large-scale generative models for visual content (\\textit{e.g.,} images, videos, and 3D objects/scenes) have made remarkable progress. However, training large-scale video generation models remains particularly challenging and resource-intensive due to cross-modal text-video alignment, the long sequence...","url_abs":"https://arxiv.org/abs/2510.17519","url_pdf":"https://arxiv.org/pdf/2510.17519v2","authors":"[\"Yongshun Zhang\",\"Zhongyi Fan\",\"Yonghang Zhang\",\"Zhangzikang Li\",\"Weifeng Chen\",\"Zhongwei Feng\",\"Chaoyue Wang\",\"Peng Hou\",\"Anxiang Zeng\"]","published":"2025-10-20T13:20:37Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[]","has_code":false,"code_links":[{"ID":608027,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2852770,"paper_url":"https://arxiv.org/abs/2510.17519","paper_title":"MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models","repo_url":"https://github.com/Shopee-MUG/MUG-V","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}