{"ID":2852397,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.21817","arxiv_id":"2510.21817","title":"VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting","abstract":"Current Vision-Language-Action (VLA) models are often constrained by a rigid, static interaction paradigm, which lacks the ability to see, hear, speak, and act concurrently as well as handle real-time user interruptions dynamically. This hinders seamless embodied collaboration, resulting in an inflexible and unresponsive user experience. To address these limitations, we introduce VITA-E, a novel embodied interaction framework designed for both behavioral concurrency and nearly real-time interruption. The core of our approach is a dual-model architecture where two parallel VLA instances operate as an ``Active Model'' and a ``Standby Model'', allowing the embodied agent to observe its environment, listen to user speech, provide verbal responses, and execute actions, all concurrently and interruptibly, mimicking human-like multitasking capabilities. We further propose a ``model-as-controller'' paradigm, where we fine-tune the VLM to generate special tokens that serve as direct system-level commands, coupling the model's reasoning with the system's behavior. Experiments conducted on a physical humanoid platform demonstrate that VITA-E can reliably handle complex interactive scenarios. Our framework is compatible with various dual-system VLA models, achieving an extremely high success rate on emergency stops and speech interruptions while also successfully performing concurrent speech and action. This represents a significant step towards more natural and capable embodied assistants.","short_abstract":"Current Vision-Language-Action (VLA) models are often constrained by a rigid, static interaction paradigm, which lacks the ability to see, hear, speak, and act concurrently as well as handle real-time user interruptions dynamically. This hinders seamless embodied collaboration, resulting in an inflexible and unresponsi...","url_abs":"https://arxiv.org/abs/2510.21817","url_pdf":"https://arxiv.org/pdf/2510.21817v1","authors":"[\"Xiaoyu Liu\",\"Chaoyou Fu\",\"Chi Yan\",\"Chu Wu\",\"Haihan Gao\",\"Yi-Fan Zhang\",\"Shaoqi Dong\",\"Cheng Qian\",\"Bin Luo\",\"Xiuyong Yang\",\"Guanwu Li\",\"Yusheng Cai\",\"Yunhang Shen\",\"Deqiang Jiang\",\"Haoyu Cao\",\"Xing Sun\",\"Caifeng Shan\",\"Ran He\"]","published":"2025-10-21T17:59:56Z","proceeding":"cs.RO","tasks":"[\"cs.RO\",\"cs.CL\",\"cs.LG\"]","methods":"[]","has_code":false}
