{"ID":2887339,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.01766","arxiv_id":"2508.01766","title":"VPN: Visual Prompt Navigation","abstract":"While natural language is commonly used to guide embodied agents, the inherent ambiguity and verbosity of language often hinder the effectiveness of language-guided navigation in complex environments. To this end, we propose Visual Prompt Navigation (VPN), a novel paradigm that guides agents to navigate using only user-provided visual prompts within 2D top-view maps. This visual prompt primarily focuses on marking the visual navigation trajectory on a top-down view of a scene, offering intuitive and spatially grounded guidance without relying on language instructions. It is more friendly for non-expert users and reduces interpretive ambiguity. We build VPN tasks in both discrete and continuous navigation settings, constructing two new datasets, R2R-VP and R2R-CE-VP, by extending existing R2R and R2R-CE episodes with corresponding visual prompts. Furthermore, we introduce VPNet, a dedicated baseline network to handle the VPN tasks, with two data augmentation strategies: view-level augmentation (altering initial headings and prompt orientations) and trajectory-level augmentation (incorporating diverse trajectories from large-scale 3D scenes), to enhance navigation performance. Extensive experiments evaluate how visual prompt forms, top-view map formats, and data augmentation strategies affect the performance of visual prompt navigation. The code is available at https://github.com/farlit/VPN.","short_abstract":"While natural language is commonly used to guide embodied agents, the inherent ambiguity and verbosity of language often hinder the effectiveness of language-guided navigation in complex environments. To this end, we propose Visual Prompt Navigation (VPN), a novel paradigm that guides agents to navigate using only user...","url_abs":"https://arxiv.org/abs/2508.01766","url_pdf":"https://arxiv.org/pdf/2508.01766v6","authors":"[\"Shuo Feng\",\"Zihan Wang\",\"Yuchen Li\",\"Rui Kong\",\"Hengyi Cai\",\"Shuaiqiang Wang\",\"Gim Hee Lee\",\"Piji Li\",\"Shuqiang Jiang\"]","published":"2025-08-03T14:07:45Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false,"code_links":[{"ID":611424,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2887339,"paper_url":"https://arxiv.org/abs/2508.01766","paper_title":"VPN: Visual Prompt Navigation","repo_url":"https://github.com/farlit/VPN","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
