{"ID":2860483,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.03639","arxiv_id":"2510.03639","title":"Towards Unsupervised Speech Recognition at the Syllable-Level","abstract":"Training speech recognizers with unpaired speech and text -- known as unsupervised speech recognition (UASR) -- is a crucial step toward extending ASR to low-resource languages in the long-tail distribution and enabling multimodal learning from non-parallel data. However, existing approaches based on phones often rely on costly resources such as grapheme-to-phoneme converters (G2Ps) and struggle to generalize to languages with ambiguous phoneme boundaries due to training instability. In this paper, we address both challenges by introducing a syllable-level UASR framework based on masked language modeling, which avoids the need for G2P and the instability of GAN-based methods. Our approach achieves up to a 40\\% relative reduction in character error rate (CER) on LibriSpeech and generalizes effectively to Mandarin, a language that has remained particularly difficult for prior methods. Code will be released upon acceptance.","short_abstract":"Training speech recognizers with unpaired speech and text -- known as unsupervised speech recognition (UASR) -- is a crucial step toward extending ASR to low-resource languages in the long-tail distribution and enabling multimodal learning from non-parallel data. However, existing approaches based on phones often rely...","url_abs":"https://arxiv.org/abs/2510.03639","url_pdf":"https://arxiv.org/pdf/2510.03639v1","authors":"[\"Liming Wang\",\"Junrui Ni\",\"Kai-Wei Chang\",\"Saurabhchand Bhati\",\"David Harwath\",\"Mark Hasegawa-Johnson\",\"James R. Glass\"]","published":"2025-10-04T02:56:33Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Language Model\",\"Generative Adversarial Network\"]","has_code":false}