Age-Aware Adapter Tuning for Children's Speech Recognition
Abstract
Children's automatic speech recognition (ASR) remains challenging because child speech differs from adult speech and varies substantially across developmental stages. While adapter tuning provides a promising way to adapt large pretrained ASR models to children's speech, a single shared child adapter may not fully capture age-dependent variation. In this work, we present one of the first systematic studies of age-aware adapter tuning for child ASR, focusing on speech from children aged 3--12 and older years. We propose age-specialized adapters trained separately for different age groups and compare them with a unified age-conditioned FiLM adapter. With ground-truth age routing, age-specialized adapters improve over the standard shared child adapter baseline from 12.6% to 12.3% overall word error rate (WER) and from 18.4% to 17.6% macro WER, while consistently improving WER for all age groups. We further show that predicted-age routing remains close to ground-truth routing, achieving 12.3% overall WER and 17.8% macro WER without ground-truth age labels at inference. In contrast, unified FiLM conditioning provides smaller gains, indicating that a single unified adapter may be insufficient to capture developmental variation in child speech.