{"ID":2863008,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.26642","arxiv_id":"2509.26642","title":"MLA: A Multisensory Language-Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation","abstract":"Vision-language-action models (VLAs) have shown generalization capabilities in robotic manipulation tasks by inheriting from vision-language models (VLMs) and learning action generation. Most VLA models focus on interpreting vision and language to generate actions, whereas robots must perceive and interact within the spatial-physical world. This gap highlights the need for a comprehensive understanding of robotic-specific multisensory information, which is crucial for achieving complex and contact-rich control. To this end, we introduce a multisensory language-action (MLA) model that collaboratively perceives heterogeneous sensory modalities and predicts future multisensory objectives to facilitate physical world modeling. Specifically, to enhance perceptual representations, we propose an encoder-free multimodal alignment scheme that innovatively repurposes the large language model itself as a perception module, directly interpreting multimodal cues by aligning 2D images, 3D point clouds, and tactile tokens through positional correspondence. To further enhance MLA's understanding of physical dynamics, we design a future multisensory generation post-training strategy that enables MLA to reason about semantic, geometric, and interaction information, providing more robust conditions for action generation. For evaluation, the MLA model outperforms the previous state-of-the-art 2D and 3D VLA methods by 12% and 24% in complex, contact-rich real-world tasks, respectively, while also demonstrating improved generalization to unseen configurations.","short_abstract":"Vision-language-action models (VLAs) have shown generalization capabilities in robotic manipulation tasks by inheriting from vision-language models (VLMs) and learning action generation. Most VLA models focus on interpreting vision and language to generate actions, whereas robots must perceive and interact within the s...","url_abs":"https://arxiv.org/abs/2509.26642","url_pdf":"https://arxiv.org/pdf/2509.26642v2","authors":"[\"Zhuoyang Liu\",\"Jiaming Liu\",\"Jiadong Xu\",\"Nuowei Han\",\"Chenyang Gu\",\"Hao Chen\",\"Kaichen Zhou\",\"Renrui Zhang\",\"Kai Chin Hsieh\",\"Kun Wu\",\"Zhengping Che\",\"Jian Tang\",\"Shanghang Zhang\"]","published":"2025-09-30T17:59:50Z","proceeding":"cs.RO","tasks":"[\"cs.RO\"]","methods":"[\"Language Model\"]","has_code":false}