{"ID":2856865,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.10740","arxiv_id":"2510.10740","title":"Dual Data Scaling for Robust Two-Stage User-Defined Keyword Spotting","abstract":"In this paper, we propose DS-KWS, a two-stage framework for robust user-defined keyword spotting. It combines a CTC-based method with a streaming phoneme search module to locate candidate segments, followed by a QbyT-based method with a phoneme matcher module for verification at both the phoneme and utterance levels. To further improve performance, we introduce a dual data scaling strategy: (1) expanding the ASR corpus from 460 to 1,460 hours to strengthen the acoustic model; and (2) leveraging over 155k anchor classes to train the phoneme matcher, significantly enhancing the distinction of confusable words. Experiments on LibriPhrase show that DS-KWS significantly outperforms existing methods, achieving 6.13\\% EER and 97.85\\% AUC on the Hard subset. On Hey-Snips, it achieves zero-shot performance comparable to full-shot trained models, reaching 99.13\\% recall at one false alarm per hour.","short_abstract":"In this paper, we propose DS-KWS, a two-stage framework for robust user-defined keyword spotting. It combines a CTC-based method with a streaming phoneme search module to locate candidate segments, followed by a QbyT-based method with a phoneme matcher module for verification at both the phoneme and utterance levels. T...","url_abs":"https://arxiv.org/abs/2510.10740","url_pdf":"https://arxiv.org/pdf/2510.10740v1","authors":"[\"Zhiqi Ai\",\"Han Cheng\",\"Yuxin Wang\",\"Shiyi Mu\",\"Shugong Xu\",\"Yongjin Zhou\"]","published":"2025-10-12T18:25:55Z","proceeding":"cs.SD","tasks":"[\"cs.SD\"]","methods":"[]","has_code":false}