{"ID":3083567,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-07T06:54:00.442624098Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.06444","arxiv_id":"2606.06444","title":"USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding","abstract":"Audio encoders are critical to modern audio applications as large language models (LLMs) increasingly rely on a single encoder for diverse inputs. While self-supervised learning (SSL) has yielded strong domain-specific encoders like speech or music experts, multi-domain approaches like USAD and SPEAR remain limited in coverage and evaluation. Recent studies also suggest supervised encoders align better with audio LLMs. We present USAD 2.0, a universal encoder integrating knowledge from both SSL and supervised foundation models. USAD 2.0 introduces domain-aware distillation to address teacher mismatch, extends coverage to the music domain, and adds second-stage supervised distillation for downstream use. We further scale the model to one billion parameters via depth scaling. Experiments show USAD 2.0 achieves strong or state-of-the-art performance across probing and LLM-based evaluations.","short_abstract":"Audio encoders are critical to modern audio applications as large language models (LLMs) increasingly rely on a single encoder for diverse inputs. While self-supervised learning (SSL) has yielded strong domain-specific encoders like speech or music experts, multi-domain approaches like USAD and SPEAR remain limited in...","url_abs":"https://arxiv.org/abs/2606.06444","url_pdf":"https://arxiv.org/pdf/2606.06444v1","authors":"[\"Heng-Jui Chang\",\"Alexander H. Liu\",\"Saurabhchand Bhati\",\"Mrudula Athi\",\"Anton Ratnarajah\",\"Amit Chhetri\",\"James Glass\"]","published":"2026-06-04T17:42:05Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.CL\",\"cs.SD\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
