{"ID":2883285,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.08962","arxiv_id":"2508.08962","title":"Selection of Layers from Self-supervised Learning Models for Predicting Mean-Opinion-Score of Speech","abstract":"Self-supervised learning (SSL) models like Wav2Vec2, HuBERT, and WavLM have been widely used in speech processing. These transformer-based models consist of multiple layers, each capturing different levels of representation. While prior studies explored their layer-wise representations for efficiency and performance, speech quality assessment (SQA) models predominantly rely on last-layer features, leaving intermediate layers underexamined. In this work, we systematically evaluate different layers of multiple SSL models for predicting mean-opinion-score (MOS). Features from each layer are fed into a lightweight regression network to assess effectiveness. Our experiments consistently show early-layers features outperform or match those from the last layer, leading to significant improvements over conventional approaches and state-of-the-art MOS prediction models. These findings highlight the advantages of early-layer selection, offering enhanced performance and reduced system complexity.","short_abstract":"Self-supervised learning (SSL) models like Wav2Vec2, HuBERT, and WavLM have been widely used in speech processing. These transformer-based models consist of multiple layers, each capturing different levels of representation. While prior studies explored their layer-wise representations for efficiency and performance, s...","url_abs":"https://arxiv.org/abs/2508.08962","url_pdf":"https://arxiv.org/pdf/2508.08962v1","authors":"[\"Xinyu Liang\",\"Fredrik Cumlin\",\"Victor Ungureanu\",\"Chandan K. A. Reddy\",\"Christian Schuldt\",\"Saikat Chatterjee\"]","published":"2025-08-12T14:25:55Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.SD\"]","methods":"[\"Transformer\"]","has_code":false}
