{"ID":2896574,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.06821","arxiv_id":"2507.06821","title":"HeLo: Heterogeneous Multi-Modal Fusion with Label Correlation for Emotion Distribution Learning","abstract":"Multi-modal emotion recognition has garnered increasing attention as it plays a significant role in human-computer interaction (HCI) in recent years. Since different discrete emotions may exist at the same time, compared with single-class emotion recognition, emotion distribution learning (EDL) that identifies a mixture of basic emotions has gradually emerged as a trend. However, existing EDL methods face challenges in mining the heterogeneity among multiple modalities. Besides, rich semantic correlations across arbitrary basic emotions are not fully exploited. In this paper, we propose a multi-modal emotion distribution learning framework, named HeLo, aimed at fully exploring the heterogeneity and complementary information in multi-modal emotional data and label correlation within mixed basic emotions. Specifically, we first adopt cross-attention to effectively fuse the physiological data. Then, an optimal transport (OT)-based heterogeneity mining module is devised to mine the interaction and heterogeneity between the physiological and behavioral representations. To facilitate label correlation learning, we introduce a learnable label embedding optimized by correlation matrix alignment. Finally, the learnable label embeddings and label correlation matrices are integrated with the multi-modal representations through a novel label correlation-driven cross-attention mechanism for accurate emotion distribution learning. Experimental results on two publicly available datasets demonstrate the superiority of our proposed method in emotion distribution learning.","short_abstract":"Multi-modal emotion recognition has garnered increasing attention as it plays a significant role in human-computer interaction (HCI) in recent years. Since different discrete emotions may exist at the same time, compared with single-class emotion recognition, emotion distribution learning (EDL) that identifies a mixtur...","url_abs":"https://arxiv.org/abs/2507.06821","url_pdf":"https://arxiv.org/pdf/2507.06821v3","authors":"[\"Chuhang Zheng\",\"Chunwei Tian\",\"Jie Wen\",\"Daoqiang Zhang\",\"Qi Zhu\"]","published":"2025-07-09T13:08:58Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.MM\"]","methods":"[]","has_code":false}
