{"ID":2844600,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.05935","arxiv_id":"2511.05935","title":"Interaction-Centric Knowledge Infusion and Transfer for Open-Vocabulary Scene Graph Generation","abstract":"Open-vocabulary scene graph generation (OVSGG) extends traditional SGG by recognizing novel objects and relationships beyond predefined categories, leveraging the knowledge from pre-trained large-scale models. Existing OVSGG methods always adopt a two-stage pipeline: 1) \\textit{Infusing knowledge} into large-scale models via pre-training on large datasets; 2) \\textit{Transferring knowledge} from pre-trained models with fully annotated scene graphs during supervised fine-tuning. However, due to a lack of explicit interaction modeling, these methods struggle to distinguish between interacting and non-interacting instances of the same object category. This limitation induces critical issues in both stages of OVSGG: it generates noisy pseudo-supervision from mismatched objects during knowledge infusion, and causes ambiguous query matching during knowledge transfer. To this end, in this paper, we propose an inter\\textbf{AC}tion-\\textbf{C}entric end-to-end OVSGG framework (\\textbf{ACC}) in an interaction-driven paradigm to minimize these mismatches. For \\textit{interaction-centric knowledge infusion}, ACC employs a bidirectional interaction prompt for robust pseudo-supervision generation to enhance the model's interaction knowledge. For \\textit{interaction-centric knowledge transfer}, ACC first adopts interaction-guided query selection that prioritizes pairing interacting objects to reduce interference from non-interacting ones. Then, it integrates interaction-consistent knowledge distillation to bolster robustness by pushing relational foreground away from the background while retaining general knowledge. Extensive experimental results on three benchmarks show that ACC achieves state-of-the-art performance, demonstrating the potential of interaction-centric paradigms for real-world applications.","short_abstract":"Open-vocabulary scene graph generation (OVSGG) extends traditional SGG by recognizing novel objects and relationships beyond predefined categories, leveraging the knowledge from pre-trained large-scale models. Existing OVSGG methods always adopt a two-stage pipeline: 1) \\textit{Infusing knowledge} into large-scale mode...","url_abs":"https://arxiv.org/abs/2511.05935","url_pdf":"https://arxiv.org/pdf/2511.05935v1","authors":"[\"Lin Li\",\"Chuhan Zhang\",\"Dong Zhang\",\"Chong Sun\",\"Chen Li\",\"Long Chen\"]","published":"2025-11-08T08:59:09Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false}