{"ID":2881586,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.11903","arxiv_id":"2508.11903","title":"OVG-HQ: Online Video Grounding with Hybrid-modal Queries","abstract":"Video grounding (VG) task focuses on locating specific moments in a video based on a query, usually in text form. However, traditional VG struggles with some scenarios like streaming video or queries using visual cues. To fill this gap, we present a new task named Online Video Grounding with Hybrid-modal Queries (OVG-HQ), which enables online segment localization using text, images, video segments, and their combinations. This task poses two new challenges: limited context in online settings and modality imbalance during training, where dominant modalities overshadow weaker ones. To address these, we propose OVG-HQ-Unify, a unified framework featuring a Parametric Memory Block (PMB) that retain previously learned knowledge to enhance current decision and a cross-modal distillation strategy that guides the learning of non-dominant modalities. This design enables a single model to effectively handle hybrid-modal queries. Due to the lack of suitable datasets, we construct QVHighlights-Unify, an expanded dataset with multi-modal queries. Besides, since offline metrics overlook prediction timeliness, we adapt them to the online setting, introducing oR@n, IoU=m, and online mean Average Precision (omAP) to evaluate both accuracy and efficiency. Experiments show that our OVG-HQ-Unify outperforms existing models, offering a robust solution for online, hybrid-modal video grounding. Source code and datasets are available at https://github.com/maojiaqi2324/OVG-HQ.","short_abstract":"Video grounding (VG) task focuses on locating specific moments in a video based on a query, usually in text form. However, traditional VG struggles with some scenarios like streaming video or queries using visual cues. To fill this gap, we present a new task named Online Video Grounding with Hybrid-modal Queries (OVG-H...","url_abs":"https://arxiv.org/abs/2508.11903","url_pdf":"https://arxiv.org/pdf/2508.11903v1","authors":"[\"Runhao Zeng\",\"Jiaqi Mao\",\"Minghao Lai\",\"Minh Hieu Phan\",\"Yanjie Dong\",\"Wei Wang\",\"Qi Chen\",\"Xiping Hu\"]","published":"2025-08-16T04:21:45Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false,"code_links":[{"ID":610826,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2881586,"paper_url":"https://arxiv.org/abs/2508.11903","paper_title":"OVG-HQ: Online Video Grounding with Hybrid-modal Queries","repo_url":"https://github.com/maojiaqi2324/OVG-HQ","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}