{"ID":2826275,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.19400","arxiv_id":"2512.19400","title":"Kunnafonidilaw ka Cadeau: an ASR dataset of present-day Bambara","abstract":"We present Kunkado, a 160-hour Bambara ASR dataset compiled from Malian radio archives to capture present-day spontaneous speech across a wide range of topics. It includes code-switching, disfluencies, background noise, and overlapping speakers that practical ASR systems encounter in real-world use. We finetuned Parakeet-based models on a 33.47-hour human-reviewed subset and apply pragmatic transcript normalization to reduce variability in number formatting, tags, and code-switching annotations. Evaluated on two real-world test sets, finetuning with Kunkado reduces WER from 44.47\\% to 37.12\\% on one and from 36.07\\% to 32.33\\% on the other. In human evaluation, the resulting model also outperforms a comparable system with the same architecture trained on 98 hours of cleaner, less realistic speech. We release the data and models to support robust ASR for predominantly oral languages.","short_abstract":"We present Kunkado, a 160-hour Bambara ASR dataset compiled from Malian radio archives to capture present-day spontaneous speech across a wide range of topics. It includes code-switching, disfluencies, background noise, and overlapping speakers that practical ASR systems encounter in real-world use. We finetuned Parake...","url_abs":"https://arxiv.org/abs/2512.19400","url_pdf":"https://arxiv.org/pdf/2512.19400v1","authors":"[\"Yacouba Diarra\",\"Panga Azazia Kamate\",\"Nouhoum Souleymane Coulibaly\",\"Michael Leventhal\"]","published":"2025-12-22T13:52:33Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[]","has_code":false}
