{"ID":2887306,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.01723","arxiv_id":"2508.01723","title":"OpenMap: Instruction Grounding via Open-Vocabulary Visual-Language Mapping","abstract":"Grounding natural language instructions to visual observations is fundamental for embodied agents operating in open-world environments. Recent advances in visual-language mapping have enabled generalizable semantic representations by leveraging vision-language models (VLMs). However, these methods often fall short in aligning free-form language commands with specific scene instances, due to limitations in both instance-level semantic consistency and instruction interpretation. We present OpenMap, a zero-shot open-vocabulary visual-language map designed for accurate instruction grounding in navigation tasks. To address semantic inconsistencies across views, we introduce a Structural-Semantic Consensus constraint that jointly considers global geometric structure and vision-language similarity to guide robust 3D instance-level aggregation. To improve instruction interpretation, we propose an LLM-assisted Instruction-to-Instance Grounding module that enables fine-grained instance selection by incorporating spatial context and expressive target descriptions. We evaluate OpenMap on ScanNet200 and Matterport3D, covering both semantic mapping and instruction-to-target retrieval tasks. Experimental results show that OpenMap outperforms state-of-the-art baselines in zero-shot settings, demonstrating the effectiveness of our method in bridging free-form language and 3D perception for embodied navigation.","short_abstract":"Grounding natural language instructions to visual observations is fundamental for embodied agents operating in open-world environments. Recent advances in visual-language mapping have enabled generalizable semantic representations by leveraging vision-language models (VLMs). However, these methods often fall short in a...","url_abs":"https://arxiv.org/abs/2508.01723","url_pdf":"https://arxiv.org/pdf/2508.01723v1","authors":"[\"Danyang Li\",\"Zenghui Yang\",\"Guangpeng Qi\",\"Songtao Pang\",\"Guangyong Shang\",\"Qiang Ma\",\"Zheng Yang\"]","published":"2025-08-03T11:25:52Z","proceeding":"cs.RO","tasks":"[\"cs.RO\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
