{"ID":2851329,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.20812","arxiv_id":"2510.20812","title":"Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation","abstract":"Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict.","short_abstract":"Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense l...","url_abs":"https://arxiv.org/abs/2510.20812","url_pdf":"https://arxiv.org/pdf/2510.20812v4","authors":"[\"Yuhan Liu\",\"Lianhui Qin\",\"Shengjie Wang\"]","published":"2025-10-23T17:59:21Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\",\"cs.CL\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":607889,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2851329,"paper_url":"https://arxiv.org/abs/2510.20812","paper_title":"Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation","repo_url":"https://github.com/Tinaliu0123/speculative-verdict","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
