{"ID":2875939,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.01471","arxiv_id":"2509.01471","title":"Hierarchical Motion Captioning Utilizing External Text Data Source","abstract":"This paper introduces a novel approach to enhance existing motion captioning methods, which directly map representations of movement to high-level descriptive captions (e.g., ``a person doing jumping jacks\"). The existing methods require motion data annotated with high-level descriptions (e.g., ``jumping jacks\"). However, such data is rarely available in existing motion-text datasets, which additionally do not include low-level motion descriptions. To address this, we propose a two-step hierarchical approach. First, we employ large language models to create detailed descriptions corresponding to each high-level caption that appears in the motion-text datasets (e.g., ``jumping while synchronizing arm extensions with the opening and closing of legs\" for ``jumping jacks\"). These refined annotations are used to retrain motion-to-text models to produce captions with low-level details. Second, we introduce a pioneering retrieval-based mechanism. It aligns the detailed low-level captions with candidate high-level captions from additional text data sources, and combine them with motion features to fabricate precise high-level captions. Our methodology is distinctive in its ability to harness knowledge from external text sources to greatly increase motion captioning accuracy, especially for movements not covered in existing motion-text datasets. Experiments on three distinct motion-text datasets (HumanML3D, KIT, and BOTH57M) demonstrate that our method achieves an improvement in average performance (across BLEU-1, BLEU-4, CIDEr, and ROUGE-L) ranging from 6% to 50% compared to the state-of-the-art M2T-Interpretable.","short_abstract":"This paper introduces a novel approach to enhance existing motion captioning methods, which directly map representations of movement to high-level descriptive captions (e.g., ``a person doing jumping jacks\"). The existing methods require motion data annotated with high-level descriptions (e.g., ``jumping jacks\"). Howev...","url_abs":"https://arxiv.org/abs/2509.01471","url_pdf":"https://arxiv.org/pdf/2509.01471v1","authors":"[\"Clayton Leite\",\"Yu Xiao\"]","published":"2025-09-01T13:39:14Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Language Model\"]","has_code":false}