{"ID":2840059,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.17618","arxiv_id":"2511.17618","title":"Foundational Question Generation for Video Question Answering via an Embedding-Integrated Approach","abstract":"Conventional VQA approaches primarily rely on question-answer (Q\u0026A) pairs to learn the spatio-temporal dynamics of video content. However, most existing annotations are event-centric, which restricts the model's ability to capture the comprehensive context of a scene. The lack of fundamental information such as object categories, spatial configurations, and descriptive visual attributes prevents the model from forming a complete understanding of the environment, ultimately limiting its generalization and reasoning capability. In this paper, we introduce Foundational Question Generation for Video Question Answering via an Embedding-Integrated Approach (FIQ), a framework designed to enhance the reasoning capability of VQA models by improving their foundational comprehension of video content. FIQ generates Q\u0026A pairs from descriptive information extracted directly from videos, thereby enriching the dataset with core scene-level attributes. These generated pairs help the model develop a more holistic understanding of the video, leading to improved generalizability and reasoning performance. In addition, we propose a VQ-CAlign module that aligns task-specific question embeddings with corresponding visual features, preserving essential contextual cues and enhancing adaptability to downstream tasks. Experimental results on the SUTD-TrafficQA dataset demonstrate that FIQ achieves state-of-the-art performance, surpassing existing baseline approaches.","short_abstract":"Conventional VQA approaches primarily rely on question-answer (Q\u0026A) pairs to learn the spatio-temporal dynamics of video content. However, most existing annotations are event-centric, which restricts the model's ability to capture the comprehensive context of a scene. The lack of fundamental information such as object...","url_abs":"https://arxiv.org/abs/2511.17618","url_pdf":"https://arxiv.org/pdf/2511.17618v1","authors":"[\"Ju-Young Oh\"]","published":"2025-11-18T13:45:50Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false}
