{"ID":2885798,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.04453","arxiv_id":"2508.04453","title":"Boosting Visual Knowledge-Intensive Training for LVLMs Through Causality-Driven Visual Object Completion","abstract":"Large Vision-Language Models (LVLMs) have experienced significant advancements in recent years. However, their performance still falls short in tasks requiring deep visual perception, such as identifying subtle differences between images. A potential cause is the scarcity of visual knowledge in popular instruction-tuning corpora, resulting in inadequate visual perception and reasoning capabilities. To address this challenge, we introduce a self-improvement framework grounded in a novel visual knowledge-intensive task, \\underline{C}ausality-driven \\underline{V}isual object \\underline{C}ompletion (CVC). This task requires LVLMs to infer the masked object in an image based on its \\textit{causal} relationships with the other visible information. We first obtain rich examples cheaply through our automated instance construction pipeline, without relying on sophisticated LVLMs (\\textit{e.g.}, GPT-4V) or human assistance. Then, LVLMs effectively self-improve through trial and error learning using these created instances. Our experiments demonstrate substantial gains across four challenging specialized tasks and four widely-used comprehensive benchmarks. Especially on specialized tasks, our method achieves an average improvement of 5.4\\% and 4.0\\% compared to the corresponding baselines when utilizing LLaVA-1.5-7B and LLaVA-1.5-13B, respectively. The code is available at https://github.com/XMUDeepLIT/CVC.","short_abstract":"Large Vision-Language Models (LVLMs) have experienced significant advancements in recent years. However, their performance still falls short in tasks requiring deep visual perception, such as identifying subtle differences between images. A potential cause is the scarcity of visual knowledge in popular instruction-tuni...","url_abs":"https://arxiv.org/abs/2508.04453","url_pdf":"https://arxiv.org/pdf/2508.04453v1","authors":"[\"Qingguo Hu\",\"Ante Wang\",\"Jia Song\",\"Delai Qiu\",\"Qingsong Liu\",\"Jinsong Su\"]","published":"2025-08-06T13:54:49Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":611242,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2885798,"paper_url":"https://arxiv.org/abs/2508.04453","paper_title":"Boosting Visual Knowledge-Intensive Training for LVLMs Through Causality-Driven Visual Object Completion","repo_url":"https://github.com/XMUDeepLIT/CVC","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
