{"ID":2826463,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.18563","arxiv_id":"2512.18563","title":"OpenView: Empowering MLLMs with Out-of-view VQA","abstract":"Recent multimodal large language models (MLLMs) show great potential in natural image understanding. Yet, they perform well, mainly on reasoning in-view contents within the image frame. This paper presents the first study on out-of-view (OOV) understanding, i.e., the ability to reason objects, activities, and scenes beyond the visible frame of a perspective view. Our technical contributions are threefold. First, we design OpenView, a four-stage pipeline to massively generate multi-choice VQA by leveraging panoramic imagery to enable context-rich and spatial-grounded VQA synthesis with free-view framing. Second, we curate OpenView-Dataset, a high-quality synthetic dataset from diverse real-world panoramas to empower MLLMs upon supervised fine-tuning. Third, we build OpenView-Bench, a benchmark that jointly measures choice and rationale accuracy for interpretable and diagnosable evaluation. Experimental results show that despite having a large gap from human performance in OOV VQA answer selection, upon empowered by OpenView, multiple MLLMs can consistently boost their performance, uplifted from 48.6% to 64.1% on average. Code, benchmark, and data will be available at https://github.com/q1xiangchen/OpenView.","short_abstract":"Recent multimodal large language models (MLLMs) show great potential in natural image understanding. Yet, they perform well, mainly on reasoning in-view contents within the image frame. This paper presents the first study on out-of-view (OOV) understanding, i.e., the ability to reason objects, activities, and scenes be...","url_abs":"https://arxiv.org/abs/2512.18563","url_pdf":"https://arxiv.org/pdf/2512.18563v1","authors":"[\"Qixiang Chen\",\"Cheng Zhang\",\"Chi-Wing Fu\",\"Jingwen Ye\",\"Jianfei Cai\"]","published":"2025-12-21T02:11:40Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":605744,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2826463,"paper_url":"https://arxiv.org/abs/2512.18563","paper_title":"OpenView: Empowering MLLMs with Out-of-view VQA","repo_url":"https://github.com/q1xiangchen/OpenView","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
