{"ID":2897657,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.05240","arxiv_id":"2507.05240","title":"StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling","abstract":"Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face trade-offs among fine-grained visual understanding, long-term context modeling and computational efficiency. We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs. The fast-streaming dialogue context facilitates responsive action generation through a sliding-window of active dialogues, while the slow-updating memory context compresses historical visual states using a 3D-aware token pruning strategy. With this slow-fast design, StreamVLN achieves coherent multi-turn dialogue through efficient KV cache reuse, supporting long video streams with bounded context size and inference cost. Experiments on VLN-CE benchmarks demonstrate state-of-the-art performance with stable low latency, ensuring robustness and efficiency in real-world deployment. The project page is: \\href{https://streamvln.github.io/}{https://streamvln.github.io/}.","short_abstract":"Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face tr...","url_abs":"https://arxiv.org/abs/2507.05240","url_pdf":"https://arxiv.org/pdf/2507.05240v1","authors":"[\"Meng Wei\",\"Chenyang Wan\",\"Xiqian Yu\",\"Tai Wang\",\"Yuqiang Yang\",\"Xiaohan Mao\",\"Chenming Zhu\",\"Wenzhe Cai\",\"Hanqing Wang\",\"Yilun Chen\",\"Xihui Liu\",\"Jiangmiao Pang\"]","published":"2025-07-07T17:49:41Z","proceeding":"cs.RO","tasks":"[\"cs.RO\",\"cs.CV\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
