{"ID":2872161,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.09332","arxiv_id":"2509.09332","title":"OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning","abstract":"Recent advances in multimodal large language models (MLLMs) have opened new opportunities for embodied intelligence, enabling multimodal understanding, reasoning, and interaction, as well as continuous spatial decision-making. Nevertheless, current MLLM-based embodied systems face two critical limitations. First, Geometric Adaptability Gap: models trained solely on 2D inputs or with hard-coded 3D geometry injection suffer from either insufficient spatial information or restricted 2D generalization, leading to poor adaptability across tasks with diverse spatial demands. Second, Embodiment Constraint Gap: prior work often neglects the physical constraints and capacities of real robots, resulting in task plans that are theoretically valid but practically infeasible. To address these gaps, we introduce OmniEVA -- an embodied versatile planner that enables advanced embodied reasoning and task planning through two pivotal innovations: (1) a Task-Adaptive 3D Grounding mechanism, which introduces a gated router to perform explicit selective regulation of 3D fusion based on contextual requirements, enabling context-aware 3D grounding for diverse embodied tasks. (2) an Embodiment-Aware Reasoning framework that jointly incorporates task goals and embodiment constraints into the reasoning loop, resulting in planning decisions that are both goal-directed and executable. Extensive experimental results demonstrate that OmniEVA not only achieves state-of-the-art general embodied reasoning performance, but also exhibits a strong ability across a wide range of downstream scenarios. Evaluations of a suite of proposed embodied benchmarks, including both primitive and composite tasks, confirm its robust and versatile planning capabilities. Project page: https://omnieva.github.io","short_abstract":"Recent advances in multimodal large language models (MLLMs) have opened new opportunities for embodied intelligence, enabling multimodal understanding, reasoning, and interaction, as well as continuous spatial decision-making. Nevertheless, current MLLM-based embodied systems face two critical limitations. First, Geome...","url_abs":"https://arxiv.org/abs/2509.09332","url_pdf":"https://arxiv.org/pdf/2509.09332v3","authors":"[\"Yuecheng Liu\",\"Dafeng Chi\",\"Shiguang Wu\",\"Zhanguang Zhang\",\"Yuzheng Zhuang\",\"Bowen Yang\",\"He Zhu\",\"Lingfeng Zhang\",\"Pengwei Xie\",\"David Gamaliel Arcos Bravo\",\"Yingxue Zhang\",\"Jianye Hao\",\"Xingyue Quan\"]","published":"2025-09-11T10:32:22Z","proceeding":"cs.RO","tasks":"[\"cs.RO\",\"cs.AI\",\"cs.CL\",\"cs.CV\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
