{"ID":2888980,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.21395","arxiv_id":"2507.21395","title":"Sync-TVA: A Graph-Attention Framework for Multimodal Emotion Recognition with Cross-Modal Fusion","abstract":"Multimodal emotion recognition (MER) is crucial for enabling emotionally intelligent systems that perceive and respond to human emotions. However, existing methods suffer from limited cross-modal interaction and imbalanced contributions across modalities. To address these issues, we propose Sync-TVA, an end-to-end graph-attention framework featuring modality-specific dynamic enhancement and structured cross-modal fusion. Our design incorporates a dynamic enhancement module for each modality and constructs heterogeneous cross-modal graphs to model semantic relations across text, audio, and visual features. A cross-attention fusion mechanism further aligns multimodal cues for robust emotion inference. Experiments on MELD and IEMOCAP demonstrate consistent improvements over state-of-the-art models in both accuracy and weighted F1 score, especially under class-imbalanced conditions.","short_abstract":"Multimodal emotion recognition (MER) is crucial for enabling emotionally intelligent systems that perceive and respond to human emotions. However, existing methods suffer from limited cross-modal interaction and imbalanced contributions across modalities. To address these issues, we propose Sync-TVA, an end-to-end grap...","url_abs":"https://arxiv.org/abs/2507.21395","url_pdf":"https://arxiv.org/pdf/2507.21395v1","authors":"[\"Zeyu Deng\",\"Yanhui Lu\",\"Jiashu Liao\",\"Shuang Wu\",\"Chongfeng Wei\"]","published":"2025-07-29T00:03:28Z","proceeding":"cs.MM","tasks":"[\"cs.MM\",\"cs.AI\",\"cs.SD\",\"eess.AS\"]","methods":"[]","has_code":false}
