{"ID":2834946,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.00975","arxiv_id":"2512.00975","title":"MM-ACT: Learn from Multimodal Parallel Generation to Act","abstract":"A generalist robotic policy needs both semantic understanding for task planning and the ability to interact with the environment through predictive capabilities. To tackle this, we present MM-ACT, a unified Vision-Language-Action (VLA) model that integrates text, image, and action in shared token space and performs generation across all three modalities. MM-ACT adopts a re-mask parallel decoding strategy for text and image generation, and employs a one-step parallel decoding strategy for action generation to improve efficiency. We introduce Context-Shared Multimodal Learning, a unified training paradigm that supervises generation in all three modalities from a shared context, enhancing action generation through cross-modal learning. Experiments were conducted on the LIBERO simulation and Franka real-robot setups as well as RoboTwin2.0 to assess in-domain and out-of-domain performances respectively. Our approach achieves a success rate of 96.3% on LIBERO, 72.0% across three tasks of real Franka, and 52.38% across eight bimanual tasks of RoboTwin2.0 with an additional gain of 9.25% from cross-modal learning. We release our codes, models and data at https://github.com/HHYHRHY/MM-ACT.","short_abstract":"A generalist robotic policy needs both semantic understanding for task planning and the ability to interact with the environment through predictive capabilities. To tackle this, we present MM-ACT, a unified Vision-Language-Action (VLA) model that integrates text, image, and action in shared token space and performs gen...","url_abs":"https://arxiv.org/abs/2512.00975","url_pdf":"https://arxiv.org/pdf/2512.00975v2","authors":"[\"Haotian Liang\",\"Xinyi Chen\",\"Bin Wang\",\"Mingkang Chen\",\"Yitian Liu\",\"Yuhao Zhang\",\"Zanxin Chen\",\"Tianshuo Yang\",\"Yilun Chen\",\"Jiangmiao Pang\",\"Dong Liu\",\"Xiaokang Yang\",\"Yao Mu\",\"Wenqi Shao\",\"Ping Luo\"]","published":"2025-11-30T16:46:35Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.LG\",\"cs.RO\"]","methods":"[]","has_code":false,"code_links":[{"ID":606463,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2834946,"paper_url":"https://arxiv.org/abs/2512.00975","paper_title":"MM-ACT: Learn from Multimodal Parallel Generation to Act","repo_url":"https://github.com/HHYHRHY/MM-ACT","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}