{"ID":2879321,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.16148","arxiv_id":"2508.16148","title":"Hierarchical Vision-Language Reasoning for Multimodal Multiple-Choice Question Answering","abstract":"Multimodal Large Language Models (MLLMs) have demonstrated remarkable multimodal understanding capabilities in Visual Question Answering (VQA) tasks by integrating visual and textual features. However, under the challenging ten-choice question evaluation paradigm, existing methods still exhibit significant limitations when processing PDF documents with complex layouts and lengthy content. Notably, current mainstream models suffer from a strong bias toward English training data, resulting in suboptimal performance for Japanese and other language scenarios. To address these challenges, this paper proposes a novel Japanese PDF document understanding framework that combines multimodal hierarchical reasoning mechanisms with Colqwen-optimized retrieval methods, while innovatively introducing a semantic verification strategy through sub-question decomposition. Experimental results demonstrate that our framework not only significantly enhances the model's deep semantic parsing capability for complex documents, but also exhibits superior robustness in practical application scenarios.","short_abstract":"Multimodal Large Language Models (MLLMs) have demonstrated remarkable multimodal understanding capabilities in Visual Question Answering (VQA) tasks by integrating visual and textual features. However, under the challenging ten-choice question evaluation paradigm, existing methods still exhibit significant limitations...","url_abs":"https://arxiv.org/abs/2508.16148","url_pdf":"https://arxiv.org/pdf/2508.16148v1","authors":"[\"Ao Zhou\",\"Zebo Gu\",\"Tenghao Sun\",\"Jiawen Chen\",\"Mingsheng Tu\",\"Zifeng Cheng\",\"Yafeng Yin\",\"Zhiwei Jiang\",\"Qing Gu\"]","published":"2025-08-22T07:17:16Z","proceeding":"cs.IR","tasks":"[\"cs.IR\",\"cs.CL\",\"cs.MM\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}