{"ID":2883040,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.08590","arxiv_id":"2508.08590","title":"QueryCraft: Transformer-Guided Query Initialization for Enhanced Human-Object Interaction Detection","abstract":"Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions in images. Although DETR-based methods have recently emerged as the mainstream framework for HOI detection, they still suffer from a key limitation: Randomly initialized queries lack explicit semantics, leading to suboptimal detection performance. To address this challenge, we propose QueryCraft, a novel plug-and-play HOI detection framework that incorporates semantic priors and guided feature learning through transformer-based query initialization. Central to our approach is \\textbf{ACTOR} (\\textbf{A}ction-aware \\textbf{C}ross-modal \\textbf{T}ransf\\textbf{OR}mer), a cross-modal Transformer encoder that jointly attends to visual regions and textual prompts to extract action-relevant features. Rather than merely aligning modalities, ACTOR leverages language-guided attention to infer interaction semantics and produce semantically meaningful query representations. To further enhance object-level query quality, we introduce a \\textbf{P}erceptual \\textbf{D}istilled \\textbf{Q}uery \\textbf{D}ecoder (\\textbf{PDQD}), which distills object category awareness from a pre-trained detector to serve as object query initiation. This dual-branch query initialization enables the model to generate more interpretable and effective queries for HOI detection. Extensive experiments on HICO-Det and V-COCO benchmarks demonstrate that our method achieves state-of-the-art performance and strong generalization. Code will be released upon publication.","short_abstract":"Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions in images. Although DETR-based methods have recently emerged as the mainstream framework for HOI detection, they still suffer from a key limitation: Randomly initialized queries lack explicit semantics, leading...","url_abs":"https://arxiv.org/abs/2508.08590","url_pdf":"https://arxiv.org/pdf/2508.08590v1","authors":"[\"Yuxiao Wang\",\"Wolin Liang\",\"Yu Lei\",\"Weiying Xue\",\"Nan Zhuang\",\"Qi Liu\"]","published":"2025-08-12T03:11:16Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.HC\"]","methods":"[\"Transformer\"]","has_code":false}
