{"ID":2858462,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.08818","arxiv_id":"2510.08818","title":"D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition","abstract":"Video large language models (Vid-LLMs), which excel in diverse video-language tasks, can be effectively constructed by adapting image-pretrained vision-language models (VLMs). However, this adaptation remains challenging, as it requires processing dense and temporally extended visual inputs that exceed the capacity of image-based models. This paper identifies the perception bottleneck and token overload as key challenges in extending image-based VLMs to the video domain. To address these issues, we propose D-CoDe, a training-free adaptation framework that incorporates dynamic compression and question decomposition. Specifically, dynamic compression alleviates the perception bottleneck through adaptive selection of representative frames and content-aware aggregation of spatial tokens, thereby reducing redundancy while preserving informative content. In parallel, question decomposition mitigates token overload by reformulating the original query into sub-questions, guiding the model to focus on distinct aspects of the video and enabling more comprehensive understanding. Experiments demonstrate that D-CoDe effectively improves video understanding across various benchmarks. Furthermore, strong performance on the challenging long-video benchmark highlights the potential of D-CoDe in handling complex video-language tasks. Code is available at https://github.com/hukcc/D-CoDe.","short_abstract":"Video large language models (Vid-LLMs), which excel in diverse video-language tasks, can be effectively constructed by adapting image-pretrained vision-language models (VLMs). However, this adaptation remains challenging, as it requires processing dense and temporally extended visual inputs that exceed the capacity of...","url_abs":"https://arxiv.org/abs/2510.08818","url_pdf":"https://arxiv.org/pdf/2510.08818v1","authors":"[\"Yiyang Huang\",\"Yizhou Wang\",\"Yun Fu\"]","published":"2025-10-09T21:08:32Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":608545,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2858462,"paper_url":"https://arxiv.org/abs/2510.08818","paper_title":"D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition","repo_url":"https://github.com/hukcc/D-CoDe","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
