{"ID":2836706,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.19877","arxiv_id":"2511.19877","title":"It Hears, It Sees too: Multi-Modal LLM for Depression Detection By Integrating Visual Understanding into Audio Language Models","abstract":"Depression is one of the most prevalent mental health disorders globally. In recent years, multi-modal data, such as speech, video, and transcripts, has been increasingly used to develop AI-assisted depression assessment systems. Large language models have further advanced this field due to their strong language understanding and generalization capabilities. However, conventional LLMs remain text-centric and cannot process the rich non-verbal cues found in audio and visual modalities, which are critical components in mental health evaluation. While multi-modal LLMs offer a promising direction, few are tailored for psychological applications. In this study, we propose a novel multi-modal LLM framework for depression detection. Our approach augments an audio language model with visual understanding and aligns audio-visual features at the timestamp level. This fine-grained alignment improves modeling of temporal dynamics across modalities while reducing the need for extensive training data and computational resources. Experiments on the DAIC-WoZ dataset demonstrate that our model outperforms both single-modality approaches and previous multi-modal methods. Moreover, the proposed framework can be extended to incorporate additional physiological signals, paving the way for broader clinical applications beyond mental health.","short_abstract":"Depression is one of the most prevalent mental health disorders globally. In recent years, multi-modal data, such as speech, video, and transcripts, has been increasingly used to develop AI-assisted depression assessment systems. Large language models have further advanced this field due to their strong language unders...","url_abs":"https://arxiv.org/abs/2511.19877","url_pdf":"https://arxiv.org/pdf/2511.19877v2","authors":"[\"Xiangyu Zhao\",\"Yaling Shen\",\"Yiwen Jiang\",\"Zimu Wang\",\"Jiahe Liu\",\"Maxmartwell H Cheng\",\"Guilherme C Oliveira\",\"Robert Desimone\",\"Dominic Dwyer\",\"Zongyuan Ge\"]","published":"2025-11-25T03:38:05Z","proceeding":"cs.MM","tasks":"[\"cs.MM\",\"cs.CV\",\"cs.LG\",\"eess.AS\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}