{"ID":2876533,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.00576","arxiv_id":"2509.00576","title":"Galaxea Open-World Dataset and G0 Dual-System VLA Model","abstract":"We present Galaxea Open-World Dataset, a large-scale, diverse collection of robot behaviors recorded in authentic human living and working environments. All demonstrations are gathered using a consistent robotic embodiment, paired with precise subtask-level language annotations to facilitate both training and evaluation. Building on this dataset, we introduce G0, a dual-system framework that couples a Vision-Language Model (VLM) for multimodal planning with a Vision-Language-Action (VLA) model for fine-grained execution. G0 is trained using a three-stage curriculum: cross-embodiment pre-training, single-embodiment pre-training, and task-specific post-training. A comprehensive benchmark spanning tabletop manipulation, few-shot learning, and long-horizon mobile manipulation, demonstrates the effectiveness of our approach. In particular, we find that the single-embodiment pre-training stage, together with the Galaxea Open-World Dataset, plays a critical role in achieving strong performance.","short_abstract":"We present Galaxea Open-World Dataset, a large-scale, diverse collection of robot behaviors recorded in authentic human living and working environments. All demonstrations are gathered using a consistent robotic embodiment, paired with precise subtask-level language annotations to facilitate both training and evaluatio...","url_abs":"https://arxiv.org/abs/2509.00576","url_pdf":"https://arxiv.org/pdf/2509.00576v1","authors":"[\"Tao Jiang\",\"Tianyuan Yuan\",\"Yicheng Liu\",\"Chenhao Lu\",\"Jianning Cui\",\"Xiao Liu\",\"Shuiqi Cheng\",\"Jiyang Gao\",\"Huazhe Xu\",\"Hang Zhao\"]","published":"2025-08-30T18:04:19Z","proceeding":"cs.RO","tasks":"[\"cs.RO\",\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false}