{"ID":2828811,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.13030","arxiv_id":"2512.13030","title":"Motus: A Unified Latent Action World Model","abstract":"While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction models). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level \"delta action\" and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi0.5) and real-world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.","short_abstract":"While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motu...","url_abs":"https://arxiv.org/abs/2512.13030","url_pdf":"https://arxiv.org/pdf/2512.13030v2","authors":"[\"Hongzhe Bi\",\"Hengkai Tan\",\"Shenghao Xie\",\"Zeyuan Wang\",\"Shuhe Huang\",\"Haitian Liu\",\"Ruowen Zhao\",\"Yao Feng\",\"Chendong Xiang\",\"Yinze Rong\",\"Hongyan Zhao\",\"Hanyu Liu\",\"Zhizhong Su\",\"Lei Ma\",\"Hang Su\",\"Jun Zhu\"]","published":"2025-12-15T06:58:40Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.LG\",\"cs.RO\"]","methods":"[\"Transformer\"]","has_code":false}
