{"ID":2892904,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.14743","arxiv_id":"2507.14743","title":"InterAct-Video: Reasoning-Rich Video QA for Urban Traffic","abstract":"Traffic monitoring is crucial for urban mobility, road safety, and intelligent transportation systems (ITS). Deep learning has advanced video-based traffic monitoring through video question answering (VideoQA) models, enabling structured insight extraction from traffic videos. However, existing VideoQA models struggle with the complexity of real-world traffic scenes, where multiple concurrent events unfold across spatiotemporal dimensions. To address these challenges, this paper introduces \\textbf{InterAct VideoQA}, a curated dataset designed to benchmark and enhance VideoQA models for traffic monitoring tasks. The InterAct VideoQA dataset comprises 8 hours of real-world traffic footage collected from diverse intersections, segmented into 10-second video clips, with over 25,000 question-answer (QA) pairs covering spatiotemporal dynamics, vehicle interactions, incident detection, and other critical traffic attributes. State-of-the-art VideoQA models are evaluated on InterAct VideoQA, exposing challenges in reasoning over fine-grained spatiotemporal dependencies within complex traffic scenarios. Additionally, fine-tuning these models on InterAct VideoQA yields notable performance improvements, demonstrating the necessity of domain-specific datasets for VideoQA. InterAct VideoQA is publicly available as a benchmark dataset to facilitate future research in real-world deployable VideoQA models for intelligent transportation systems. GitHub Repo: https://github.com/joe-rabbit/InterAct_VideoQA","short_abstract":"Traffic monitoring is crucial for urban mobility, road safety, and intelligent transportation systems (ITS). Deep learning has advanced video-based traffic monitoring through video question answering (VideoQA) models, enabling structured insight extraction from traffic videos. However, existing VideoQA models struggle...","url_abs":"https://arxiv.org/abs/2507.14743","url_pdf":"https://arxiv.org/pdf/2507.14743v3","authors":"[\"Joseph Raj Vishal\",\"Divesh Basina\",\"Rutuja Patil\",\"Manas Srinivas Gowda\",\"Katha Naik\",\"Yezhou Yang\",\"Bharatesh Chakravarthi\"]","published":"2025-07-19T20:30:43Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false,"code_links":[{"ID":612026,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2892904,"paper_url":"https://arxiv.org/abs/2507.14743","paper_title":"InterAct-Video: Reasoning-Rich Video QA for Urban Traffic","repo_url":"https://github.com/joe-rabbit/InterAct_VideoQA","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}