{"ID":2876905,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.00230","arxiv_id":"2509.00230","title":"Evaluating the Effectiveness of Transformer Layers in Wav2Vec 2.0, XLS-R, and Whisper for Speaker Identification Tasks","abstract":"This study evaluates the performance of three advanced speech encoder models, Wav2Vec 2.0, XLS-R, and Whisper, in speaker identification tasks. By fine-tuning these models and analyzing their layer-wise representations using SVCCA, k-means clustering, and t-SNE visualizations, we found that Wav2Vec 2.0 and XLS-R capture speaker-specific features effectively in their early layers, with fine-tuning improving stability and performance. Whisper showed better performance in deeper layers. Additionally, we determined the optimal number of transformer layers for each model when fine-tuned for speaker identification tasks.","short_abstract":"This study evaluates the performance of three advanced speech encoder models, Wav2Vec 2.0, XLS-R, and Whisper, in speaker identification tasks. By fine-tuning these models and analyzing their layer-wise representations using SVCCA, k-means clustering, and t-SNE visualizations, we found that Wav2Vec 2.0 and XLS-R captur...","url_abs":"https://arxiv.org/abs/2509.00230","url_pdf":"https://arxiv.org/pdf/2509.00230v2","authors":"[\"Linus Stuhlmann\",\"Michael Alexander Saxer\"]","published":"2025-08-29T20:39:42Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.AI\",\"cs.CL\",\"cs.LG\",\"eess.AS\"]","methods":"[\"Transformer\"]","has_code":false}
