{"ID":2827494,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.16395","arxiv_id":"2512.16395","title":"BEST-STD2.0: Balanced and Efficient Speech Tokenizer for Spoken Term Detection","abstract":"Fast and accurate spoken content retrieval is vital for applications such as voice search. Query-by-Example Spoken Term Detection (STD) involves retrieving matching segments from an audio database given a spoken query. Token-based STD systems, which use discrete speech representations, enable efficient search but struggle with robustness to noise and reverberation, and with inefficient token utilization. We address these challenges by proposing a noise and reverberation-augmented training strategy to improve tokenizer robustness. In addition, we introduce optimal transport-based regularization to ensure balanced token usage and enhance token efficiency. To further speed up retrieval, we adopt a TF-IDF-based search mechanism. Empirical evaluations demonstrate that the proposed method outperforms STD baselines across various distortion levels while maintaining high search efficiency.","short_abstract":"Fast and accurate spoken content retrieval is vital for applications such as voice search. Query-by-Example Spoken Term Detection (STD) involves retrieving matching segments from an audio database given a spoken query. Token-based STD systems, which use discrete speech representations, enable efficient search but strug...","url_abs":"https://arxiv.org/abs/2512.16395","url_pdf":"https://arxiv.org/pdf/2512.16395v2","authors":"[\"Anup Singh\",\"Vipul Arora\",\"Kris Demuynck\"]","published":"2025-12-18T10:48:00Z","proceeding":"eess.AS","tasks":"[\"eess.AS\"]","methods":"[]","has_code":false}
