{"ID":2840878,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.12405","arxiv_id":"2511.12405","title":"VLA-R: Vision-Language Action Retrieval toward Open-World End-to-End Autonomous Driving","abstract":"Exploring open-world situations in an end-to-end manner is a promising yet challenging task due to the need for strong generalization capabilities. In particular, end-to-end autonomous driving in unstructured outdoor environments often encounters conditions that were unfamiliar during training. In this work, we present Vision-Language Action Retrieval (VLA-R), an open-world end-to-end autonomous driving (OW-E2EAD) framework that integrates open-world perception with a novel vision-action retrieval paradigm. We leverage a frozen vision-language model for open-world detection and segmentation to obtain multi-scale, prompt-guided, and interpretable perception features without domain-specific tuning. A Q-Former bottleneck aggregates fine-grained visual representations with language-aligned visual features, bridging perception and action domains. To learn transferable driving behaviors, we introduce a vision-action contrastive learning scheme that aligns vision-language and action embeddings for effective open-world reasoning and action retrieval. Our experiments on a real-world robotic platform demonstrate strong generalization and exploratory performance in unstructured, unseen environments, even with limited data. Demo videos are provided in the supplementary material.","short_abstract":"Exploring open-world situations in an end-to-end manner is a promising yet challenging task due to the need for strong generalization capabilities. In particular, end-to-end autonomous driving in unstructured outdoor environments often encounters conditions that were unfamiliar during training. In this work, we present...","url_abs":"https://arxiv.org/abs/2511.12405","url_pdf":"https://arxiv.org/pdf/2511.12405v1","authors":"[\"Hyunki Seong\",\"Seongwoo Moon\",\"Hojin Ahn\",\"Jehun Kang\",\"David Hyunchul Shim\"]","published":"2025-11-16T00:55:28Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Language Model\",\"LoRA\"]","has_code":false}
