{"ID":2858562,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.06638","arxiv_id":"2510.06638","title":"StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering","abstract":"Knowledge-based Visual Question Answering (KVQA) requires models to ground entities in images and reason over factual knowledge. Recent work has introduced its implicit-knowledge variant, IK-KVQA, where a multimodal large language model (MLLM) is the sole knowledge source and answers are produced without external retrieval. Existing IK-KVQA approaches, however, are typically trained with answer-only supervision: reasoning remains implicit, justifications are often weak or inconsistent, and generalization after standard supervised fine-tuning (SFT) can be brittle. We propose StaR-KVQA, a framework that equips IK-KVQA with dual-path structured reasoning traces - symbolic relation paths over text and vision together with path-grounded natural-language explanations - to provide a stronger inductive bias than generic answer-only supervision. These traces act as modality-aware scaffolds that guide the model toward relevant entities and attributes, offering more structure than generic chain-of-thought supervision while not constraining reasoning to any single fixed path. With a single open-source MLLM, StaR-KVQA constructs and selects traces to build an offline trace-enriched dataset and then performs structure-aware self-distillation; no external retrievers, verifiers, or curated knowledge bases are used, and inference is a single autoregressive pass. Across benchmarks, StaR-KVQA consistently improves both answer accuracy and the transparency of intermediate reasoning, achieving up to +11.3% higher answer accuracy on OK-VQA over the strongest baseline.","short_abstract":"Knowledge-based Visual Question Answering (KVQA) requires models to ground entities in images and reason over factual knowledge. Recent work has introduced its implicit-knowledge variant, IK-KVQA, where a multimodal large language model (MLLM) is the sole knowledge source and answers are produced without external retri...","url_abs":"https://arxiv.org/abs/2510.06638","url_pdf":"https://arxiv.org/pdf/2510.06638v3","authors":"[\"Zhihao Wen\",\"Wenkang Wei\",\"Yuan Fang\",\"Xingtong Yu\",\"Hui Zhang\",\"Weicheng Zhu\",\"Xin Zhang\"]","published":"2025-10-08T04:37:53Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
