{"ID":2921648,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-03T05:56:00.181519634Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.01113","arxiv_id":"2606.01113","title":"R^3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking","abstract":"The CoVR-R challenge evaluates composed video retrieval, where a system must retrieve a target video from a large gallery given a reference video and a textual edit instruction. This setting is not a standard video-text retrieval problem: the query is defined by both the visual evidence in the source video and the transformation implied by the edit. A strong embedding model can provide scalable candidate recall, but it may under-express target-side consequences such as state changes, action replacement, object preservation, or temporal consistency. A pairwise multimodal reranker can verify such details more directly, but exhaustive reranking over the full gallery is computationally infeasible. We present $\\mathbb{R}^3$, a zero-shot composed video retrieval pipeline built around Reasoning-guided Recalling and Reranking. The core idea is to turn the source-edit query into a reasoning-grounded retrieval program rather than treating the edit text as a short caption. First, the model generates a reasoning trace that describes the expected target video after applying the edit. Then the trace is encoded together with the source video as a reasoning-augmented query, and its retrieval score is fused with the base composed query through an agreement-gated residual rule. At last, a re-ranker verifies the recalled candidates with direct source-candidate comparison. Experiments have demonstrated the effectiveness of our method in addressing this challenge. Codes are available on https://github.com/Lee-zixu/R-3.","short_abstract":"The CoVR-R challenge evaluates composed video retrieval, where a system must retrieve a target video from a large gallery given a reference video and a textual edit instruction. This setting is not a standard video-text retrieval problem: the query is defined by both the visual evidence in the source video and the tran...","url_abs":"https://arxiv.org/abs/2606.01113","url_pdf":"https://arxiv.org/pdf/2606.01113v1","authors":"[\"Zixu Li\",\"Yupeng Hu\",\"Zhiheng Fu\",\"Zhiwei Chen\",\"Weili Guan\",\"Liqiang Nie\"]","published":"2026-05-31T09:20:53Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false,"code_links":[{"ID":612590,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-02T02:42:49.606572591Z","DeletedAt":null,"paper_id":2921648,"paper_url":"https://arxiv.org/abs/2606.01113","paper_title":"R^3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking","repo_url":"https://github.com/Lee-zixu/R-3","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
