{"ID":2826028,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.18933","arxiv_id":"2512.18933","title":"Point What You Mean: Visually Grounded Instruction Policy","abstract":"Vision-Language-Action (VLA) models align vision and language with embodied control, but their object referring ability remains limited when relying solely on text prompt, especially in cluttered or out-of-distribution (OOD) scenes. In this study, we introduce the Point-VLA, a plug-and-play policy that augments language instructions with explicit visual cues (e.g., bounding boxes) to resolve referential ambiguity and enable precise object-level grounding. To efficiently scale visually grounded datasets, we further develop an automatic data annotation pipeline requiring minimal human effort. We evaluate Point-VLA on diverse real-world referring tasks and observe consistently stronger performance than text-only instruction VLAs, particularly in cluttered or unseen-object scenarios, with robust generalization. These results demonstrate that Point-VLA effectively resolves object referring ambiguity through pixel-level visual grounding, achieving more generalizable embodied control.","short_abstract":"Vision-Language-Action (VLA) models align vision and language with embodied control, but their object referring ability remains limited when relying solely on text prompt, especially in cluttered or out-of-distribution (OOD) scenes. In this study, we introduce the Point-VLA, a plug-and-play policy that augments languag...","url_abs":"https://arxiv.org/abs/2512.18933","url_pdf":"https://arxiv.org/pdf/2512.18933v2","authors":"[\"Hang Yu\",\"Juntu Zhao\",\"Yufeng Liu\",\"Kaiyu Li\",\"Cheng Ma\",\"Di Zhang\",\"Yingdong Hu\",\"Guang Chen\",\"Junyuan Xie\",\"Junliang Guo\",\"Junqiao Zhao\",\"Yang Gao\"]","published":"2025-12-22T00:44:19Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.RO\"]","methods":"[]","has_code":false}