{"ID":2890098,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.19821","arxiv_id":"2507.19821","title":"LAVA: Language Driven Scalable and Versatile Traffic Video Analytics","abstract":"In modern urban environments, camera networks generate massive amounts of operational footage -- reaching petabytes each day -- making scalable video analytics essential for efficient processing. Many existing approaches adopt an SQL-based paradigm for querying such large-scale video databases; however, this constrains queries to rigid patterns with predefined semantic categories, significantly limiting analytical flexibility. In this work, we explore a language-driven video analytics paradigm aimed at enabling flexible and efficient querying of high-volume video data driven by natural language. Particularly, we build \\textsc{Lava}, a system that accepts natural language queries and retrieves traffic targets across multiple levels of granularity and arbitrary categories. \\textsc{Lava} comprises three main components: 1) a multi-armed bandit-based efficient sampling method for video segment-level localization; 2) a video-specific open-world detection module for object-level retrieval; and 3) a long-term object trajectory extraction scheme for temporal object association, yielding complete trajectories for object-of-interests. To support comprehensive evaluation, we further develop a novel benchmark by providing diverse, semantically rich natural language predicates and fine-grained annotations for multiple videos. Experiments on this benchmark demonstrate that \\textsc{Lava} improves $F_1$-scores for selection queries by $\\mathbf{14\\%}$, reduces MPAE for aggregation queries by $\\mathbf{0.39}$, and achieves top-$k$ precision of $\\mathbf{86\\%}$, while processing videos $ \\mathbf{9.6\\times} $ faster than the most accurate baseline. Our code and dataset are available at https://github.com/yuyanrui/LAVA.","short_abstract":"In modern urban environments, camera networks generate massive amounts of operational footage -- reaching petabytes each day -- making scalable video analytics essential for efficient processing. Many existing approaches adopt an SQL-based paradigm for querying such large-scale video databases; however, this constrains...","url_abs":"https://arxiv.org/abs/2507.19821","url_pdf":"https://arxiv.org/pdf/2507.19821v2","authors":"[\"Yanrui Yu\",\"Tianfei Zhou\",\"Jiaxin Sun\",\"Lianpeng Qiao\",\"Lizhong Ding\",\"Ye Yuan\",\"Guoren Wang\"]","published":"2025-07-26T06:38:07Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.MM\"]","methods":"[]","has_code":false,"code_links":[{"ID":611738,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2890098,"paper_url":"https://arxiv.org/abs/2507.19821","paper_title":"LAVA: Language Driven Scalable and Versatile Traffic Video Analytics","repo_url":"https://github.com/yuyanrui/LAVA","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
