{"ID":2891434,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.17659","arxiv_id":"2507.17659","title":"See the Forest and the Trees: A Synergistic Reasoning Framework for Knowledge-Based Visual Question Answering","abstract":"Multimodal Large Language Models (MLLMs) have pushed the frontiers of Knowledge-Based Visual Question Answering (KBVQA), yet their reasoning is fundamentally bottlenecked by a reliance on uni-dimensional evidence. This \"seeing only the trees, but not the forest\" approach prevents robust, multi-faceted understanding. Inspired by the principle of seeing both the forest and trees, we propose Synergos-VQA, a novel synergistic reasoning framework. At its core, Synergos-VQA concurrently generates and fuses three complementary evidence streams at inference time: (1) Holistic Evidence to perceive the entire scene (the \"forest\"), (2) Structural Evidence from a prototype-driven module to identify key objects (the \"trees\"), and (3) Causal Evidence from a counterfactual probe to ensure the reasoning is robustly grounded. By synergistically fusing this multi-faceted evidence, our framework achieves a more comprehensive and reliable reasoning process. Extensive experiments show that Synergos-VQA decisively establishes a new state-of-the-art on three challenging benchmarks, including OK-VQA and A-OKVQA. Furthermore, our approach demonstrates strong plug-and-play capabilities, significantly boosting various open-source MLLMs and proving that superior methodological design can outperform sheer model scale.","short_abstract":"Multimodal Large Language Models (MLLMs) have pushed the frontiers of Knowledge-Based Visual Question Answering (KBVQA), yet their reasoning is fundamentally bottlenecked by a reliance on uni-dimensional evidence. This \"seeing only the trees, but not the forest\" approach prevents robust, multi-faceted understanding. In...","url_abs":"https://arxiv.org/abs/2507.17659","url_pdf":"https://arxiv.org/pdf/2507.17659v3","authors":"[\"Junjie Wang\",\"Yunhan Tang\",\"Yijie Wang\",\"Zhihao Yuan\",\"Huan Wang\",\"Yangfan He\",\"Bin Li\"]","published":"2025-07-23T16:24:57Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
