{"ID":2877302,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.21112","arxiv_id":"2508.21112","title":"EO-1: An Open Unified Embodied Foundation Model for General Robot Control","abstract":"The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general purpose embodied intelligent systems. Recent vision-language-action (VLA) models, which are co-trained on large-scale robot and visual-text data, have demonstrated notable progress in general robot control. However, they still fail to achieve human-level flexibility in interleaved reasoning and interaction. In this work, we introduce EO-Robotics, consists of EO-1 model and EO-Data1.5M dataset. EO-1 is a unified embodied foundation model that achieves superior performance in multimodal embodied reasoning and robot control through interleaved vision-text-action pre-training. The development of EO-1 is based on two key pillars: (i) a unified architecture that processes multimodal inputs indiscriminately (image, text, video, and action), and (ii) a massive, high-quality multimodal embodied reasoning dataset, EO-Data1.5M, which contains over 1.5 million samples with emphasis on interleaved vision-text-action comprehension. EO-1 is trained through synergies between auto-regressive decoding and flow matching denoising on EO-Data1.5M, enabling seamless robot action generation and multimodal embodied reasoning. Extensive experiments demonstrate the effectiveness of interleaved vision-text-action learning for open-world understanding and generalization, validated through a variety of long-horizon, dexterous manipulation tasks across multiple embodiments. This paper details the architecture of EO-1, the data construction strategy of EO-Data1.5M, and the training methodology, offering valuable insights for developing advanced embodied foundation models. Project Page: https://eo-robotics.ai/eo-1.","short_abstract":"The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general purpose embodied intelligent systems. Recent vision-language-action (VLA) models, which are co-trained on large-scale robot and visual-text data, have demonstrated notable progress in gener...","url_abs":"https://arxiv.org/abs/2508.21112","url_pdf":"https://arxiv.org/pdf/2508.21112v5","authors":"[\"Delin Qu\",\"Haoming Song\",\"Qizhi Chen\",\"Zhaoqing Chen\",\"Xianqiang Gao\",\"Dong Wang\",\"Xinyi Ye\",\"Qi Lv\",\"Modi Shi\",\"Guanghui Ren\",\"Cheng Ruan\",\"Maoqing Yao\",\"Haoran Yang\",\"Jiacheng Bao\",\"Bin Zhao\",\"Xuelong Li\"]","published":"2025-08-28T17:26:15Z","proceeding":"cs.RO","tasks":"[\"cs.RO\",\"cs.AI\"]","methods":"[]","project_urls":"[\"https://eo-robotics.ai/eo-1\"]","has_code":false}