{"ID":3084873,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-07T05:49:02.101151534Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05739","arxiv_id":"2606.05739","title":"Do speech foundation models perceive speaker similarity as humans do?","abstract":"This study presents a comparative analysis between the speaker embeddings of speech foundation models and human subjective perception of speaker similarity. Human listeners have the ability to judge speaker similarity on a continuous scale discerning how similar two voices are. In contrast, speech foundation models embed speaker characteristics into numerical representation. However, a question remains: does the numerical distance between speaker embeddings in these models truly align with the similarity perceived by humans? To address this, we conduct a comprehensive investigation using more than 40 models to compare model-derived distances with human-perceived similarity scores. Furthermore, we identify which factors in model configuration contribute most to a speaker embedding that mirrors human perception. Our findings provide insights for the development of more perceptually grounded speech foundation models.","short_abstract":"This study presents a comparative analysis between the speaker embeddings of speech foundation models and human subjective perception of speaker similarity. Human listeners have the ability to judge speaker similarity on a continuous scale discerning how similar two voices are. In contrast, speech foundation models emb...","url_abs":"https://arxiv.org/abs/2606.05739","url_pdf":"https://arxiv.org/pdf/2606.05739v1","authors":"[\"Minoru Kishi\",\"Hayato Yagi\",\"Shinnosuke Takamichi\",\"Yuki Saito\"]","published":"2026-06-04T06:04:18Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"eess.AS\"]","methods":"[]","has_code":false}
