{"ID":2879549,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.16576","arxiv_id":"2508.16576","title":"Benchmarking Training Paradigms, Dataset Composition, and Model Scaling for Child ASR in ESPnet","abstract":"Despite advancements in ASR, child speech recognition remains challenging due to acoustic variability and limited annotated data. While fine-tuning adult ASR models on child speech is common, comparisons with flat-start training remain underexplored. We compare flat-start training across multiple datasets, SSL representations (WavLM, XEUS), and decoder architectures. Our results show that SSL representations are biased toward adult speech, with flat-start training on child speech mitigating these biases. We also analyze model scaling, finding consistent improvements up to 1B parameters, beyond which performance plateaus. Additionally, age-related ASR and speaker verification analysis highlights the limitations of proprietary models like Whisper, emphasizing the need for open-data models for reliable child speech research. All investigations are conducted using ESPnet, and our publicly available benchmark provides insights into training strategies for robust child speech processing.","short_abstract":"Despite advancements in ASR, child speech recognition remains challenging due to acoustic variability and limited annotated data. While fine-tuning adult ASR models on child speech is common, comparisons with flat-start training remain underexplored. We compare flat-start training across multiple datasets, SSL represen...","url_abs":"https://arxiv.org/abs/2508.16576","url_pdf":"https://arxiv.org/pdf/2508.16576v1","authors":"[\"Anyu Ying\",\"Natarajan Balaji Shankar\",\"Chyi-Jiunn Lin\",\"Mohan Shi\",\"Pu Wang\",\"Hye-jin Shim\",\"Siddhant Arora\",\"Hugo Van hamme\",\"Abeer Alwan\",\"Shinji Watanabe\"]","published":"2025-08-22T17:59:35Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[]","has_code":false}
