{"ID":2877223,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.20869","arxiv_id":"2508.20869","title":"OLMoASR: Open Models and Data for Training Robust Speech Recognition Models","abstract":"Improvements in training data scale and quality have led to significant advances, yet its influence in speech recognition remains underexplored. In this paper, we present a large-scale dataset, OLMoASR-Pool, and series of models, OLMoASR, to study and develop robust zero-shot speech recognition models. Beginning from OLMoASR-Pool, a collection of 3M hours of English audio and 17M transcripts, we design text heuristic filters to remove low-quality or mistranscribed data. Our curation pipeline produces a new dataset containing 1M hours of high-quality audio-transcript pairs, which we call OLMoASR-Mix. We use OLMoASR-Mix to train the OLMoASR-Mix suite of models, ranging from 39M (tiny.en) to 1.5B (large.en) parameters. Across all model scales, OLMoASR achieves comparable average performance to OpenAI's Whisper on short and long-form speech recognition benchmarks. Notably, OLMoASR-medium.en attains a 12.8\\% and 11.0\\% word error rate (WER) that is on par with Whisper's largest English-only model Whisper-medium.en's 12.4\\% and 10.5\\% WER for short and long-form recognition respectively (at equivalent parameter count). OLMoASR-Pool, OLMoASR models, and filtering, training and evaluation code will be made publicly available to further research on robust speech processing.","short_abstract":"Improvements in training data scale and quality have led to significant advances, yet its influence in speech recognition remains underexplored. In this paper, we present a large-scale dataset, OLMoASR-Pool, and series of models, OLMoASR, to study and develop robust zero-shot speech recognition models. Beginning from O...","url_abs":"https://arxiv.org/abs/2508.20869","url_pdf":"https://arxiv.org/pdf/2508.20869v1","authors":"[\"Huong Ngo\",\"Matt Deitke\",\"Martijn Bartelds\",\"Sarah Pratt\",\"Josh Gardner\",\"Matt Jordan\",\"Ludwig Schmidt\"]","published":"2025-08-28T15:00:51Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.CL\",\"cs.LG\",\"eess.AS\"]","methods":"[]","has_code":false}
