{"ID":2825364,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.20938","arxiv_id":"2512.20938","title":"Pioneering Multimodal Emotion Recognition in the Era of Large Models: From Closed Sets to Open Vocabularies","abstract":"Recent advances in multimodal large language models (MLLMs) have demonstrated remarkable multi- and cross-modal integration capabilities. However, their potential for fine-grained emotion understanding remains systematically underexplored. While open-vocabulary multimodal emotion recognition (MER-OV) has emerged as a promising direction to overcome the limitations of closed emotion sets, no comprehensive evaluation of MLLMs in this context currently exists. To address this, our work presents the first large-scale benchmarking study of MER-OV on the OV-MERD dataset, evaluating 19 mainstream MLLMs, including general-purpose, modality-specialized, and reasoning-enhanced architectures. Through systematic analysis of model reasoning capacity, fusion strategies, contextual utilization, and prompt design, we provide key insights into the capabilities and limitations of current MLLMs for MER-OV. Our evaluation reveals that a two-stage, trimodal (audio, video, and text) fusion approach achieves optimal performance in MER-OV, with video emerging as the most critical modality. We further identify a surprisingly narrow gap between open- and closed-source LLMs. These findings establish essential benchmarks and offer practical guidelines for advancing open-vocabulary and fine-grained affective computing, paving the way for more nuanced and interpretable emotion AI systems. Associated code will be made publicly available upon acceptance.","short_abstract":"Recent advances in multimodal large language models (MLLMs) have demonstrated remarkable multi- and cross-modal integration capabilities. However, their potential for fine-grained emotion understanding remains systematically underexplored. While open-vocabulary multimodal emotion recognition (MER-OV) has emerged as a p...","url_abs":"https://arxiv.org/abs/2512.20938","url_pdf":"https://arxiv.org/pdf/2512.20938v1","authors":"[\"Jing Han\",\"Zhiqiang Gao\",\"Shihao Gao\",\"Jialing Liu\",\"Hongyu Chen\",\"Zixing Zhang\",\"Björn W. Schuller\"]","published":"2025-12-24T04:42:30Z","proceeding":"cs.HC","tasks":"[\"cs.HC\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}