{"ID":3005022,"CreatedAt":"2026-06-03T03:09:48.883664427Z","UpdatedAt":"2026-06-05T06:46:15.197025399Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.03240","arxiv_id":"2606.03240","title":"GeoAlign: Beyond Semantics with State-Guided Spatial Alignment in VLA Models","abstract":"Current Vision--Language--Action (VLA) models often optimize for semantic grounding, whereas executable manipulation requires geometry-aware spatial alignment and dynamic affordance selection. We introduce GeoAlign, a state-guided spatial alignment architecture for VLA policy learning. GeoAlign post-trains an RGB geometry branch with robot-domain RGB-D supervision, yielding RGB-derived Geometry-Enhanced Post-Trained (GEP) features for policy rollout. The robot's proprioceptive state queries the GEP feature grid, producing compact, phase-dependent geometry tokens for action prediction. GeoAlign achieves 99.0% on LIBERO, 85.3% across three SimplerEnv-Fractal tasks, and 78.8% on eight geometry-critical real-world ALOHA tasks, with ablations confirming the value of geometry post-training and proprioceptive-state-guided querying.","short_abstract":"Current Vision--Language--Action (VLA) models often optimize for semantic grounding, whereas executable manipulation requires geometry-aware spatial alignment and dynamic affordance selection. We introduce GeoAlign, a state-guided spatial alignment architecture for VLA policy learning. GeoAlign post-trains an RGB geome...","url_abs":"https://arxiv.org/abs/2606.03240","url_pdf":"https://arxiv.org/pdf/2606.03240v1","authors":"[\"Yizhi Chen\",\"Zhanxiang Cao\",\"Xinyi Peng\",\"Yixiao Zheng\",\"Xiaxi Si\",\"Yiheng Li\",\"Liyun Yan\",\"Keqi Zhu\",\"Xueyun Chen\",\"Shengcheng Fu\",\"Tianyue Zhan\",\"Yufei Jia\",\"Jinming Yao\",\"Yan Xie\",\"Kun Wang\",\"Cewu Lu\",\"Yue Gao\"]","published":"2026-06-02T07:01:18Z","proceeding":"cs.RO","tasks":"[\"cs.RO\"]","methods":"[]","has_code":false}