{"ID":2895457,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.13374","arxiv_id":"2507.13374","title":"Smart Routing for Multimodal Video Retrieval: When to Search What","abstract":"We introduce ModaRoute, an LLM-based intelligent routing system that dynamically selects optimal modalities for multimodal video retrieval. While dense text captions can achieve 75.9% Recall@5, they require expensive offline processing and miss critical visual information present in 34% of clips with scene text not captured by ASR. By analyzing query intent and predicting information needs, ModaRoute reduces computational overhead by 41% while achieving 60.9% Recall@5. Our approach uses GPT-4.1 to route queries across ASR (speech), OCR (text), and visual indices, averaging 1.78 modalities per query versus exhaustive 3.0 modality search. Evaluation on 1.8M video clips demonstrates that intelligent routing provides a practical solution for scaling multimodal retrieval systems, reducing infrastructure costs while maintaining competitive effectiveness for real-world deployment.","short_abstract":"We introduce ModaRoute, an LLM-based intelligent routing system that dynamically selects optimal modalities for multimodal video retrieval. While dense text captions can achieve 75.9% Recall@5, they require expensive offline processing and miss critical visual information present in 34% of clips with scene text not cap...","url_abs":"https://arxiv.org/abs/2507.13374","url_pdf":"https://arxiv.org/pdf/2507.13374v1","authors":"[\"Kevin Dela Rosa\"]","published":"2025-07-12T15:45:03Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\",\"cs.IR\"]","methods":"[\"Large Language Model\"]","has_code":false}
