{"ID":2869963,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.14161","arxiv_id":"2509.14161","title":"CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset","abstract":"We present CS-FLEURS, a new dataset for developing and evaluating code-switched speech recognition and translation systems beyond high-resourced languages. CS-FLEURS consists of 4 test sets which cover in total 113 unique code-switched language pairs across 52 languages: 1) a 14 X-English language pair set with real voices reading synthetically generated code-switched sentences, 2) a 16 X-English language pair set with generative text-to-speech 3) a 60 {Arabic, Mandarin, Hindi, Spanish}-X language pair set with the generative text-to-speech, and 4) a 45 X-English lower-resourced language pair test set with concatenative text-to-speech. Besides the four test sets, CS-FLEURS also provides a training set with 128 hours of generative text-to-speech data across 16 X-English language pairs. Our hope is that CS-FLEURS helps to broaden the scope of future code-switched speech research. Dataset link: https://huggingface.co/datasets/byan/cs-fleurs.","short_abstract":"We present CS-FLEURS, a new dataset for developing and evaluating code-switched speech recognition and translation systems beyond high-resourced languages. CS-FLEURS consists of 4 test sets which cover in total 113 unique code-switched language pairs across 52 languages: 1) a 14 X-English language pair set with real vo...","url_abs":"https://arxiv.org/abs/2509.14161","url_pdf":"https://arxiv.org/pdf/2509.14161v1","authors":"[\"Brian Yan\",\"Injy Hamed\",\"Shuichiro Shimizu\",\"Vasista Lodagala\",\"William Chen\",\"Olga Iakovenko\",\"Bashar Talafha\",\"Amir Hussein\",\"Alexander Polok\",\"Kalvin Chang\",\"Dominik Klement\",\"Sara Althubaiti\",\"Puyuan Peng\",\"Matthew Wiesner\",\"Thamar Solorio\",\"Ahmed Ali\",\"Sanjeev Khudanpur\",\"Shinji Watanabe\",\"Chih-Chen Chen\",\"Zhen Wu\",\"Karim Benharrak\",\"Anuj Diwan\",\"Samuele Cornell\",\"Eunjung Yeo\",\"Kwanghee Choi\",\"Carlos Carvalho\",\"Karen Rosero\"]","published":"2025-09-17T16:45:22Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.SD\",\"eess.AS\"]","methods":"[]","has_code":false}
