{"ID":2891518,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.17918","arxiv_id":"2507.17918","title":"One Whisper to Grade Them All","abstract":"We present an efficient end-to-end approach for holistic Automatic Speaking Assessment (ASA) of multi-part second-language tests, developed for the 2025 Speak \u0026 Improve Challenge. Our system's main novelty is the ability to process all four spoken responses with a single Whisper-small encoder, combine all information via a lightweight aggregator, and predict the final score. This architecture removes the need for transcription and per-part models, cuts inference time, and makes ASA practical for large-scale Computer-Assisted Language Learning systems. Our system achieved a Root Mean Squared Error (RMSE) of 0.384, outperforming the text-based baseline (0.44) while using at most 168M parameters (about 70% of Whisper-small). Furthermore, we propose a data sampling strategy, allowing the model to train on only 44.8% of the speakers in the corpus and still reach 0.383 RMSE, demonstrating improved performance on imbalanced classes and strong data efficiency.","short_abstract":"We present an efficient end-to-end approach for holistic Automatic Speaking Assessment (ASA) of multi-part second-language tests, developed for the 2025 Speak \u0026 Improve Challenge. Our system's main novelty is the ability to process all four spoken responses with a single Whisper-small encoder, combine all information v...","url_abs":"https://arxiv.org/abs/2507.17918","url_pdf":"https://arxiv.org/pdf/2507.17918v1","authors":"[\"Nhan Phan\",\"Anusha Porwal\",\"Yaroslav Getman\",\"Ekaterina Voskoboinik\",\"Tamás Grósz\",\"Mikko Kurimo\"]","published":"2025-07-23T20:31:40Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"eess.AS\"]","methods":"[]","has_code":false}
