{"ID":2869574,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.15430","arxiv_id":"2509.15430","title":"BiRQ: Bi-Level Self-Labeling Random Quantization for Self-Supervised Speech Recognition","abstract":"Speech is a rich signal, and labeled audio-text pairs are costly, making self-supervised learning essential for scalable representation learning. A core challenge in speech SSL is generating pseudo-labels that are both informative and efficient: strong labels, such as those used in HuBERT, improve downstream performance but rely on external encoders and multi-stage pipelines, while efficient methods like BEST-RQ achieve simplicity at the cost of weaker labels. We propose BiRQ, a bilevel SSL framework that combines the efficiency of BEST-RQ with the refinement benefits of HuBERT-style label enhancement. The key idea is to reuse part of the model itself as a pseudo-label generator: intermediate representations are discretized by a random-projection quantizer to produce enhanced labels, while anchoring labels derived directly from the raw input stabilize training and prevent collapse. Training is formulated as an efficient first-order bilevel optimization problem, solved end-to-end with differentiable Gumbel-softmax selection. This design eliminates the need for external label encoders, reduces memory cost, and enables iterative label refinement in an end-to-end fashion. BiRQ consistently improves over BEST-RQ while maintaining low complexity and computational efficiency. We validate our method on various datasets, including 960-hour LibriSpeech, 150-hour AMI meetings and 5,000-hour YODAS, demonstrating consistent gains over BEST-RQ.","short_abstract":"Speech is a rich signal, and labeled audio-text pairs are costly, making self-supervised learning essential for scalable representation learning. A core challenge in speech SSL is generating pseudo-labels that are both informative and efficient: strong labels, such as those used in HuBERT, improve downstream performanc...","url_abs":"https://arxiv.org/abs/2509.15430","url_pdf":"https://arxiv.org/pdf/2509.15430v1","authors":"[\"Liuyuan Jiang\",\"Xiaodong Cui\",\"Brian Kingsbury\",\"Tianyi Chen\",\"Lisha Chen\"]","published":"2025-09-18T21:09:29Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.SD\",\"eess.AS\"]","methods":"[]","has_code":false}
