{"ID":2898000,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.04554","arxiv_id":"2507.04554","title":"Self-supervised learning of speech representations with Dutch archival data","abstract":"This paper explores the use of Dutch archival television broadcast data for self-supervised learning of speech foundation models, specifically wav2vec 2.0. We first study data quality assumptions for pre-training, and show how music, noise and speaker overlap affect SSL convergence and downstream fine-tuning performance. Secondly, we explore effectively pre-processing strategies to convert the noisy broadcast dataset into a qualitative dataset for pre-training, by using Whisper and WhisperX. Thirdly, we compare mono-lingual and multi-lingual pre-training with equivalent amounts of data, and show that mono-lingual pre-training is more robust to out-of-domain data. Lastly, we achieve a state-of-the-art LARGE wav2vec 2.0 model for the Dutch language, by a continuation of pre-training a wav2vec 2.0 XLS-R model checkpoint with our 55k hour archival dataset.","short_abstract":"This paper explores the use of Dutch archival television broadcast data for self-supervised learning of speech foundation models, specifically wav2vec 2.0. We first study data quality assumptions for pre-training, and show how music, noise and speaker overlap affect SSL convergence and downstream fine-tuning performanc...","url_abs":"https://arxiv.org/abs/2507.04554","url_pdf":"https://arxiv.org/pdf/2507.04554v2","authors":"[\"Nik Vaessen\",\"Roeland Ordelman\",\"David A. van Leeuwen\"]","published":"2025-07-06T22:11:22Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.CL\",\"cs.LG\",\"eess.AS\"]","methods":"[]","has_code":false}
