{"ID":2880818,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.14039","arxiv_id":"2508.14039","title":"Beyond Simple Edits: Composed Video Retrieval with Dense Modifications","abstract":"Composed video retrieval is a challenging task that strives to retrieve a target video based on a query video and a textual description detailing specific modifications. Standard retrieval frameworks typically struggle to handle the complexity of fine-grained compositional queries and variations in temporal understanding limiting their retrieval ability in the fine-grained setting. To address this issue, we introduce a novel dataset that captures both fine-grained and composed actions across diverse video segments, enabling more detailed compositional changes in retrieved video content. The proposed dataset, named Dense-WebVid-CoVR, consists of 1.6 million samples with dense modification text that is around seven times more than its existing counterpart. We further develop a new model that integrates visual and textual information through Cross-Attention (CA) fusion using grounded text encoder, enabling precise alignment between dense query modifications and target videos. The proposed model achieves state-of-the-art results surpassing existing methods on all metrics. Notably, it achieves 71.3\\% Recall@1 in visual+text setting and outperforms the state-of-the-art by 3.4\\%, highlighting its efficacy in terms of leveraging detailed video descriptions and dense modification texts. Our proposed dataset, code, and model are available at :https://github.com/OmkarThawakar/BSE-CoVR","short_abstract":"Composed video retrieval is a challenging task that strives to retrieve a target video based on a query video and a textual description detailing specific modifications. Standard retrieval frameworks typically struggle to handle the complexity of fine-grained compositional queries and variations in temporal understandi...","url_abs":"https://arxiv.org/abs/2508.14039","url_pdf":"https://arxiv.org/pdf/2508.14039v1","authors":"[\"Omkar Thawakar\",\"Dmitry Demidov\",\"Ritesh Thawkar\",\"Rao Muhammad Anwer\",\"Mubarak Shah\",\"Fahad Shahbaz Khan\",\"Salman Khan\"]","published":"2025-08-19T17:59:39Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false,"code_links":[{"ID":610730,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2880818,"paper_url":"https://arxiv.org/abs/2508.14039","paper_title":"Beyond Simple Edits: Composed Video Retrieval with Dense Modifications","repo_url":"https://github.com/OmkarThawakar/BSE-CoVR","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
