{"ID":2886247,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.03266","arxiv_id":"2508.03266","title":"EgoPrompt: Prompt Learning for Egocentric Action Recognition","abstract":"Driven by the increasing demand for applications in augmented and virtual reality, egocentric action recognition has emerged as a prominent research area. It is typically divided into two subtasks: recognizing the performed behavior (i.e., verb component) and identifying the objects being acted upon (i.e., noun component) from the first-person perspective. However, most existing approaches treat these two components as independent classification tasks, focusing on extracting component-specific knowledge while overlooking their inherent semantic and contextual relationships, leading to fragmented representations and sub-optimal generalization capability. To address these challenges, we propose a prompt learning-based framework, EgoPrompt, to conduct the egocentric action recognition task. Building on the existing prompting strategy to capture the component-specific knowledge, we construct a Unified Prompt Pool space to establish interaction between the two types of component representations. Specifically, the component representations (from verbs and nouns) are first decomposed into fine-grained patterns with the prompt pair form. Then, these pattern-level representations are fused through an attention-based mechanism to facilitate cross-component interaction. To ensure the prompt pool is informative, we further introduce a novel training objective, Diverse Pool Criteria. This objective realizes our goals from two perspectives: Prompt Selection Frequency Regularization and Prompt Knowledge Orthogonalization. Extensive experiments are conducted on the Ego4D, EPIC-Kitchens, and EGTEA datasets. The results consistently show that EgoPrompt achieves state-of-the-art performance across within-dataset, cross-dataset, and base-to-novel generalization benchmarks.","short_abstract":"Driven by the increasing demand for applications in augmented and virtual reality, egocentric action recognition has emerged as a prominent research area. It is typically divided into two subtasks: recognizing the performed behavior (i.e., verb component) and identifying the objects being acted upon (i.e., noun compone...","url_abs":"https://arxiv.org/abs/2508.03266","url_pdf":"https://arxiv.org/pdf/2508.03266v2","authors":"[\"Huaihai Lyu\",\"Chaofan Chen\",\"Yuheng Ji\",\"Changsheng Xu\"]","published":"2025-08-05T09:47:07Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false}
