{"ID":2852412,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.19001","arxiv_id":"2510.19001","title":"Robust Driving QA through Metadata-Grounded Context and Task-Specific Prompts","abstract":"We present a two-phase vision-language QA system for autonomous driving that answers high-level perception, prediction, and planning questions. In Phase-1, a large multimodal LLM (Qwen2.5-VL-32B) is conditioned on six-camera inputs, a short temporal window of history, and a chain-of-thought prompt with few-shot exemplars. A self-consistency ensemble (multiple sampled reasoning chains) further improves answer reliability. In Phase-2, we augment the prompt with nuScenes scene metadata (object annotations, ego-vehicle state, etc.) and category-specific question instructions (separate prompts for perception, prediction, planning tasks). In experiments on a driving QA benchmark, our approach significantly outperforms the baseline Qwen2.5 models. For example, using 5 history frames and 10-shot prompting in Phase-1 yields 65.1% overall accuracy (vs.62.61% with zero-shot); applying self-consistency raises this to 66.85%. Phase-2 achieves 67.37% overall. Notably, the system maintains 96% accuracy under severe visual corruption. These results demonstrate that carefully engineered prompts and contextual grounding can greatly enhance high-level driving QA with pretrained vision-language models.","short_abstract":"We present a two-phase vision-language QA system for autonomous driving that answers high-level perception, prediction, and planning questions. In Phase-1, a large multimodal LLM (Qwen2.5-VL-32B) is conditioned on six-camera inputs, a short temporal window of history, and a chain-of-thought prompt with few-shot exempla...","url_abs":"https://arxiv.org/abs/2510.19001","url_pdf":"https://arxiv.org/pdf/2510.19001v1","authors":"[\"Seungjun Yu\",\"Junsung Park\",\"Youngsun Lim\",\"Hyunjung Shim\"]","published":"2025-10-21T18:24:59Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\",\"cs.RO\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
