{"ID":2863415,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.24445","arxiv_id":"2509.24445","title":"Beyond Isolated Facts: Synthesizing Narrative and Grounded Supervision for VideoQA","abstract":"The performance of Video Question Answering (VideoQA) models is fundamentally constrained by the nature of their supervision, which typically consists of isolated, factual question-answer pairs. This \"bag-of-facts\" approach fails to capture the underlying narrative and causal structure of events, limiting models to a shallow understanding of video content. To move beyond this paradigm, we introduce a framework to synthesize richer supervisory signals. We propose two complementary strategies: Question-Based Paraphrasing (QBP), which synthesizes the diverse inquiries (what, how, why) from a video's existing set of question-answer pairs into a holistic narrative paragraph that reconstructs the video's event structure; and Question-Based Captioning (QBC), which generates fine-grained visual rationales, grounding the answer to each question in specific, relevant evidence. Leveraging powerful generative models, we use this synthetic data to train VideoQA models under a unified next-token prediction objective. Extensive experiments on STAR and NExT-QA validate our approach, demonstrating significant accuracy gains and establishing new state-of-the-art results, such as improving a 3B model to 72.5\\% on STAR (+4.9\\%) and a 7B model to 80.8\\% on NExT-QA. Beyond accuracy, our analysis reveals that both QBP and QBC substantially enhance cross-dataset generalization, with QBP additionally accelerating model convergence by over 2.5x. These results demonstrate that shifting data synthesis from isolated facts to narrative coherence and grounded rationales yields a more accurate, efficient, and generalizable training paradigm.","short_abstract":"The performance of Video Question Answering (VideoQA) models is fundamentally constrained by the nature of their supervision, which typically consists of isolated, factual question-answer pairs. This \"bag-of-facts\" approach fails to capture the underlying narrative and causal structure of events, limiting models to a s...","url_abs":"https://arxiv.org/abs/2509.24445","url_pdf":"https://arxiv.org/pdf/2509.24445v1","authors":"[\"Jianxin Liang\",\"Tan Yue\",\"Yuxuan Wang\",\"Yueqian Wang\",\"Zhihan Yin\",\"Huishuai Zhang\",\"Dongyan Zhao\"]","published":"2025-09-29T08:28:44Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.CL\"]","methods":"[]","has_code":false}
