{"ID":2887946,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.00579","arxiv_id":"2508.00579","title":"MHier-RAG: Multi-Modal RAG for Visual-Rich Document Question-Answering via Hierarchical and Multi-Granularity Reasoning","abstract":"The multi-modal long-context document question-answering task aims to locate and integrate multi-modal evidences (such as texts, tables, charts, images, and layouts) distributed across multiple pages, for question understanding and answer generation. The existing methods can be categorized into Large Vision-Language Model (LVLM)-based and Retrieval-Augmented Generation (RAG)-based methods. However, the former were susceptible to hallucinations, while the latter struggled for inter-modal disconnection and cross-page fragmentation. To address these challenges, a novel multi-modal RAG model, named MHier-RAG, was proposed, leveraging both textual and visual information across long-range pages to facilitate accurate question answering for visual-rich documents. A hierarchical indexing method with the integration of flattened in-page chunks and topological cross-page chunks was designed to jointly establish in-page multi-modal associations and long-distance cross-page dependencies. By means of joint similarity evaluation and large language model (LLM)-based re-ranking, a multi-granularity semantic retrieval method, including the page-level parent page retrieval and document-level summary retrieval, was proposed to foster multi-modal evidence connection and long-distance evidence integration and reasoning. Experimental results performed on public datasets, MMLongBench-Doc and LongDocURL, demonstrated the superiority of our MHier-RAG method in understanding and answering modality-rich and multi-page documents.","short_abstract":"The multi-modal long-context document question-answering task aims to locate and integrate multi-modal evidences (such as texts, tables, charts, images, and layouts) distributed across multiple pages, for question understanding and answer generation. The existing methods can be categorized into Large Vision-Language Mo...","url_abs":"https://arxiv.org/abs/2508.00579","url_pdf":"https://arxiv.org/pdf/2508.00579v3","authors":"[\"Ziyu Gong\",\"Chengcheng Mai\",\"Yihua Huang\"]","published":"2025-08-01T12:22:53Z","proceeding":"cs.MM","tasks":"[\"cs.MM\",\"cs.IR\"]","methods":"[\"RAG\",\"Large Language Model\",\"Language Model\"]","has_code":false}
