{"ID":2839392,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.15145","arxiv_id":"2511.15145","title":"Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding","abstract":"Human voice encodes both identity and paralinguistic cues, yet encoders in large audio-language models (LALMs) rarely balance both aspects. In this work, we present a study toward building a general-purpose voice encoder that captures nuanced voice cues. Through a comprehensive evaluation, we find that multi-task training yields the most balanced representations, whereas contrastive language-audio pretraining (CLAP) primarily improves retrieval without enhancing paralinguistic understanding. Our final encoder, Auden-Voice, also demonstrates strong performance when integrated with LLMs. The code and training recipes will be released with the audio understanding toolkit Auden.","short_abstract":"Human voice encodes both identity and paralinguistic cues, yet encoders in large audio-language models (LALMs) rarely balance both aspects. In this work, we present a study toward building a general-purpose voice encoder that captures nuanced voice cues. Through a comprehensive evaluation, we find that multi-task train...","url_abs":"https://arxiv.org/abs/2511.15145","url_pdf":"https://arxiv.org/pdf/2511.15145v1","authors":"[\"Mingyue Huo\",\"Wei-Cheng Tseng\",\"Yiwen Shao\",\"Hao Zhang\",\"Dong Yu\"]","published":"2025-11-19T05:53:34Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.SD\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}