{"ID":3083928,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-07T05:32:54.120957816Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05931","arxiv_id":"2606.05931","title":"To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection","abstract":"When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimodal system. We propose a query-adaptive framework that detects active modalities via cross-modal score consistency: when both modalities are active, files retrieved by one also score highly on the other; this agreement breaks down when a modality is absent. Classifiers driven by these cross-modal features achieve 89% detection accuracy. On the BBC Rewind corpus (with over 12,000 broadcast videos) the adaptive system attains 94.2% P@1, outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle with ground-truth modality labels (96.6%).","short_abstract":"When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimo...","url_abs":"https://arxiv.org/abs/2606.05931","url_pdf":"https://arxiv.org/pdf/2606.05931v1","authors":"[\"Erfan Loweimi\",\"Mengjie Qian\",\"Kate Knill\",\"Guanfeng Wu\",\"Chi-Ho Chan\",\"Abbas Haider\",\"Muhammad Awan\",\"Josef Kittler\",\"Hui Wang\",\"Mark Gales\"]","published":"2026-06-04T09:33:58Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.CV\",\"cs.IR\",\"cs.LG\",\"cs.MM\",\"eess.AS\"]","methods":"[]","has_code":false}
