{"ID":2831479,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.07136","arxiv_id":"2512.07136","title":"A Large-Scale Multimodal Dataset and Benchmarks for Human Activity Scene Understanding and Reasoning","abstract":"Multimodal human action recognition (HAR) leverages complementary sensors for activity classification. Beyond recognition, recent advances in large language models (LLMs) enable detailed descriptions and causal reasoning, motivating new tasks: human action understanding (HAU) and human action reasoning (HARn). However, most LLMs, especially large vision language models (LVLMs), struggle with non-RGB modalities such as depth, IMU, and mmWave due to the lack of large-scale data-caption resources. Existing HAR datasets mainly provide coarse data-label annotations, which are insufficient to capture fine-grained action dynamics needed for HAU and HARn. We consider two ground-truth pair types: (1) data label (discrete category) and (2) data caption (textual description). Naively generating captions from labels often lacks logical and spatiotemporal consistency. We introduce CUHK-X, a large-scale multimodal dataset and benchmark suite for HAR, HAU, and HARn. CUHK-X contains 58,445 samples covering 40 actions performed by 30 participants across two indoor environments. To improve caption consistency, we propose a prompt-based scene creation method that leverages LLMs to generate logically connected activity sequences, followed by human validation. CUHK-X includes three benchmarks with six evaluation tasks. Experiments report average accuracies of 76.52% (HAR), 40.76% (HAU), and 70.25% (HARn). CUHK-X aims to enable the community to apply and develop data-intensive learning methods for robust, multimodal human activity analysis. Project page and code: https://openaiotlab.github.io/CUHK-X/ and https://github.com/openaiotlab/CUHK-X.","short_abstract":"Multimodal human action recognition (HAR) leverages complementary sensors for activity classification. Beyond recognition, recent advances in large language models (LLMs) enable detailed descriptions and causal reasoning, motivating new tasks: human action understanding (HAU) and human action reasoning (HARn). However,...","url_abs":"https://arxiv.org/abs/2512.07136","url_pdf":"https://arxiv.org/pdf/2512.07136v1","authors":"[\"Siyang Jiang\",\"Mu Yuan\",\"Xiang Ji\",\"Bufang Yang\",\"Zeyu Liu\",\"Lilin Xu\",\"Yang Li\",\"Yuting He\",\"Liran Dong\",\"Wenrui Lu\",\"Zhenyu Yan\",\"Xiaofan Jiang\",\"Wei Gao\",\"Hongkai Chen\",\"Guoliang Xing\"]","published":"2025-12-08T03:40:52Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":606128,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2831479,"paper_url":"https://arxiv.org/abs/2512.07136","paper_title":"A Large-Scale Multimodal Dataset and Benchmarks for Human Activity Scene Understanding and Reasoning","repo_url":"https://github.com/openaiotlab/CUHK-X","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}