{"ID":3083732,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-07T05:32:54.120957816Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.06155","arxiv_id":"2606.06155","title":"AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding","abstract":"Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch between VLM semantic spaces and embodied control policies often hinders the learning of precise perception--action mappings. To address this challenge, we propose \\textbf{AffordanceVLA}, a unified framework that introduces structured affordance forecasting as a task-oriented intermediate representation to establish a more precise and robust perception--action mapping. Specifically, we progressively model manipulation priors through three complementary components: 1) \\textbf{Which2Act} for object-centric grounding via visual latent prediction to suppress distractions; 2) \\textbf{Where2Act} for 2D interaction localization via affordance map estimation; and 3) \\textbf{How2Act} for 3D geometric reasoning to guide manipulation policies. These affordance cues provide spatially grounded, semantically conditioned, and action-coupled intermediate representations, thereby naturally bridging vision, language and action. We integrate these modules into a Mixture-of-Transformer (MoT) architecture with specialized experts and train the model using a three-stage training strategy with a progressive data curriculum. To overcome the scarcity of dense affordance labels in robotic datasets, we also develop a robust automated data augmentation pipeline. Extensive experiments on simulation and real-world demonstrate that AffordanceVLA achieves strong performance across diverse manipulation scenarios.","short_abstract":"Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch between VLM semantic spaces and embodied control policies often hinders the learning of precise perception--action mapp...","url_abs":"https://arxiv.org/abs/2606.06155","url_pdf":"https://arxiv.org/pdf/2606.06155v1","authors":"[\"Qize Yu\",\"Jiadi You\",\"Yuran Wang\",\"Jiaqi Liang\",\"Bowen Ping\",\"Yang Tian\",\"Yue Chen\",\"Minghong Cai\",\"Zeying Gong\",\"Ruihai Wu\",\"Yinchuan Li\",\"Junwei Liang\",\"Yingcong Chen\"]","published":"2026-06-04T13:28:51Z","proceeding":"cs.RO","tasks":"[\"cs.RO\",\"cs.CV\",\"cs.MM\"]","methods":"[\"Transformer\",\"Language Model\"]","has_code":false}
