{"ID":2853391,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.16444","arxiv_id":"2510.16444","title":"RefAtomNet++: Advancing Referring Atomic Video Action Recognition using Semantic Retrieval based Multi-Trajectory Mamba","abstract":"Referring Atomic Video Action Recognition (RAVAR) aims to recognize fine-grained, atomic-level actions of a specific person of interest conditioned on natural language descriptions. Distinct from conventional action recognition and detection tasks, RAVAR emphasizes precise language-guided action understanding, which is particularly critical for interactive human action analysis in complex multi-person scenarios. In this work, we extend our previously introduced RefAVA dataset to RefAVA++, which comprises \u003e2.9 million frames and \u003e75.1k annotated persons in total. We benchmark this dataset using baselines from multiple related domains, including atomic action localization, video question answering, and text-video retrieval, as well as our earlier model, RefAtomNet. Although RefAtomNet surpasses other baselines by incorporating agent attention to highlight salient features, its ability to align and retrieve cross-modal information remains limited, leading to suboptimal performance in localizing the target person and predicting fine-grained actions. To overcome the aforementioned limitations, we introduce RefAtomNet++, a novel framework that advances cross-modal token aggregation through a multi-hierarchical semantic-aligned cross-attention mechanism combined with multi-trajectory Mamba modeling at the partial-keyword, scene-attribute, and holistic-sentence levels. In particular, scanning trajectories are constructed by dynamically selecting the nearest visual spatial tokens at each timestep for both partial-keyword and scene-attribute levels. Moreover, we design a multi-hierarchical semantic-aligned cross-attention strategy, enabling more effective aggregation of spatial and temporal tokens across different semantic hierarchies. Experiments show that RefAtomNet++ establishes new state-of-the-art results. The dataset and code are released at https://github.com/KPeng9510/refAVA2.","short_abstract":"Referring Atomic Video Action Recognition (RAVAR) aims to recognize fine-grained, atomic-level actions of a specific person of interest conditioned on natural language descriptions. Distinct from conventional action recognition and detection tasks, RAVAR emphasizes precise language-guided action understanding, which is...","url_abs":"https://arxiv.org/abs/2510.16444","url_pdf":"https://arxiv.org/pdf/2510.16444v1","authors":"[\"Kunyu Peng\",\"Di Wen\",\"Jia Fu\",\"Jiamin Wu\",\"Kailun Yang\",\"Junwei Zheng\",\"Ruiping Liu\",\"Yufan Chen\",\"Yuqian Fu\",\"Danda Pani Paudel\",\"Luc Van Gool\",\"Rainer Stiefelhagen\"]","published":"2025-10-18T10:41:19Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.MM\",\"cs.RO\",\"eess.IV\"]","methods":"[]","has_code":false,"code_links":[{"ID":608078,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2853391,"paper_url":"https://arxiv.org/abs/2510.16444","paper_title":"RefAtomNet++: Advancing Referring Atomic Video Action Recognition using Semantic Retrieval based Multi-Trajectory Mamba","repo_url":"https://github.com/KPeng9510/refAVA2","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}