{"ID":2869087,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.14527","arxiv_id":"2509.14527","title":"CLAIP-Emo: Parameter-Efficient Adaptation of Language-supervised models for In-the-Wild Audiovisual Emotion Recognition","abstract":"Audiovisual emotion recognition (AVER) in the wild is still hindered by pose variation, occlusion, and background noise. Prevailing methods primarily rely on large-scale domain-specific pre-training, which is costly and often mismatched to real-world affective data. To address this, we present CLAIP-Emo, a modular framework that reframes in-the-wild AVER as a parameter-efficient adaptation of language-supervised foundation models (CLIP/CLAP). Specifically, it (i) preserves language-supervised priors by freezing CLIP/CLAP backbones and performing emotion-oriented adaptation via LoRA (updating \\ensuremath{\\le}4.0\\% of the total parameters), (ii) allocates temporal modeling asymmetrically, employing a lightweight Transformer for visual dynamics while applying mean pooling for audio prosody, and (iii) applies a simple fusion head for prediction. On DFEW and MAFW, CLAIP-Emo (ViT-L/14) achieves 80.14\\% and 61.18\\% weighted average recall with only 8M training parameters, setting a new state of the art. Our findings suggest that parameter-efficient adaptation of language-supervised foundation models provides a scalable alternative to domain-specific pre-training for real-world AVER. The code and models will be available at \\href{https://github.com/MSA-LMC/CLAIP-Emo}{https://github.com/MSA-LMC/CLAIP-Emo}.","short_abstract":"Audiovisual emotion recognition (AVER) in the wild is still hindered by pose variation, occlusion, and background noise. Prevailing methods primarily rely on large-scale domain-specific pre-training, which is costly and often mismatched to real-world affective data. To address this, we present CLAIP-Emo, a modular fram...","url_abs":"https://arxiv.org/abs/2509.14527","url_pdf":"https://arxiv.org/pdf/2509.14527v1","authors":"[\"Yin Chen\",\"Jia Li\",\"Jinpeng Hu\",\"Zhenzhen Hu\",\"Richang Hong\"]","published":"2025-09-18T01:45:44Z","proceeding":"cs.MM","tasks":"[\"cs.MM\",\"cs.SD\"]","methods":"[\"Transformer\",\"LoRA\"]","has_code":false,"code_links":[{"ID":609648,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2869087,"paper_url":"https://arxiv.org/abs/2509.14527","paper_title":"CLAIP-Emo: Parameter-Efficient Adaptation of Language-supervised models for In-the-Wild Audiovisual Emotion Recognition","repo_url":"https://github.com/MSA-LMC/CLAIP-Emo","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}