{"ID":2892193,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.15491","arxiv_id":"2507.15491","title":"Prompt-aware of Frame Sampling for Efficient Text-Video Retrieval","abstract":"Enabling efficient text-video retrieval on edge-end devices is critical for real-world applications. Yet, existing methods face a critical challenge in balancing accuracy and computational efficiency: uniform frame sampling methods ensure content coverage but incur prohibitive computational costs, while salient-frame sampling methods reduce overhead but suffer from query-agnostic frame selection that biases retrieval results. To address this, we propose ProCLIP, a user-centric framework that achieves state-of-the-art accuracy with significantly improved efficiency. We design a prompt-aware frame sampling strategy that dynamically guides lightweight feature extractors using textual prompts to select semantically relevant frames, overcoming the limitations of existing salient-frame sampling methods which rely on static, query-agnostic selection criteria. Moreover, we adopt a two-stage candidate pruning strategy that combines rapid coarse filtering via a lightweight module with CLIP-powered fine-grained re-ranking, enhancing retrieval efficiency while preserving accuracy. Experiments across benchmarks show ProCLIP achieves 75.3% latency reduction versus baselines while maintaining competitive accuracy, i.e., R@1=49.0 in MSR-VTT dataset. Code is available at https://github.com/tiffylong/ProCLIP.","short_abstract":"Enabling efficient text-video retrieval on edge-end devices is critical for real-world applications. Yet, existing methods face a critical challenge in balancing accuracy and computational efficiency: uniform frame sampling methods ensure content coverage but incur prohibitive computational costs, while salient-frame s...","url_abs":"https://arxiv.org/abs/2507.15491","url_pdf":"https://arxiv.org/pdf/2507.15491v1","authors":"[\"Deyu Zhang\",\"Tingting Long\",\"Jinrui Zhang\",\"Ligeng Chen\",\"Ju Ren\",\"Yaoxue Zhang\"]","published":"2025-07-21T10:46:49Z","proceeding":"cs.MM","tasks":"[\"cs.MM\",\"cs.CV\"]","methods":"[]","has_code":false,"code_links":[{"ID":611965,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2892193,"paper_url":"https://arxiv.org/abs/2507.15491","paper_title":"Prompt-aware of Frame Sampling for Efficient Text-Video Retrieval","repo_url":"https://github.com/tiffylong/ProCLIP","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
