{"ID":2867834,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.18004","arxiv_id":"2509.18004","title":"WenetSpeech-Chuan: A Large-Scale Sichuanese Corpus with Rich Annotation for Dialectal Speech Processing","abstract":"The scarcity of large-scale, open-source data for dialects severely hinders progress in speech technology, a challenge particularly acute for the widely spoken Sichuanese dialects of Chinese. To address this critical gap, we introduce WenetSpeech-Chuan, a 10,000-hour, richly annotated corpus constructed using our novel Chuan-Pipeline, a complete data processing framework for dialectal speech. To facilitate rigorous evaluation and demonstrate the corpus's effectiveness, we also release high-quality ASR and TTS benchmarks, WenetSpeech-Chuan-Eval, with manually verified transcriptions. Experiments show that models trained on WenetSpeech-Chuan achieve state-of-the-art performance among open-source systems and demonstrate results comparable to commercial services. As the largest open-source corpus for Sichuanese dialects, WenetSpeech-Chuan not only lowers the barrier to research in dialectal speech processing but also plays a crucial role in promoting AI equity and mitigating bias in speech technologies. The corpus, benchmarks, models, and receipts are publicly available on our project page.","short_abstract":"The scarcity of large-scale, open-source data for dialects severely hinders progress in speech technology, a challenge particularly acute for the widely spoken Sichuanese dialects of Chinese. To address this critical gap, we introduce WenetSpeech-Chuan, a 10,000-hour, richly annotated corpus constructed using our novel...","url_abs":"https://arxiv.org/abs/2509.18004","url_pdf":"https://arxiv.org/pdf/2509.18004v1","authors":"[\"Yuhang Dai\",\"Ziyu Zhang\",\"Shuai Wang\",\"Longhao Li\",\"Zhao Guo\",\"Tianlun Zuo\",\"Shuiyuan Wang\",\"Hongfei Xue\",\"Chengyou Wang\",\"Qing Wang\",\"Xin Xu\",\"Hui Bu\",\"Jie Li\",\"Jian Kang\",\"Binbin Zhang\",\"Lei Xie\"]","published":"2025-09-22T16:44:00Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.SD\"]","methods":"[]","has_code":false}
