{"ID":2890290,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.18897","arxiv_id":"2507.18897","title":"HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling","abstract":"Discrete speech tokenization is a fundamental component in speech codecs. However, in large-scale speech-to-speech systems, the complexity of parallel streams from multiple quantizers and the computational cost of high-time-dimensional codecs pose significant challenges. In this paper, we introduce HH-Codec, a neural codec that achieves extreme compression at 24 tokens per second for 24 kHz audio while relying on single-quantizer inference. Our approach involves a carefully designed Vector Quantization space for Spoken Language Modeling, optimizing compression efficiency while minimizing information loss. Building on this, we propose an asymmetric encoder-decoder architecture (Audio-VQ-Mel-Audio) that leverages dual supervision and progressive training to enhance reconstruction stability and fidelity. HH-Codec achieves state-of-the-art performance in speech reconstruction with an ultra-low bandwidth of 0.3 kbps. We further evaluate its effectiveness in codebook utilization and generative model adaptation, with extensive ablations validating the necessity of each module. HH-Codec is available at https://github.com/opendilab/HH-Codec.","short_abstract":"Discrete speech tokenization is a fundamental component in speech codecs. However, in large-scale speech-to-speech systems, the complexity of parallel streams from multiple quantizers and the computational cost of high-time-dimensional codecs pose significant challenges. In this paper, we introduce HH-Codec, a neural c...","url_abs":"https://arxiv.org/abs/2507.18897","url_pdf":"https://arxiv.org/pdf/2507.18897v1","authors":"[\"Rongkun Xue\",\"Yazhe Niu\",\"Shuai Hu\",\"Zixin Yin\",\"Yongqiang Yao\",\"Jing Yang\"]","published":"2025-07-25T02:44:30Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.AI\",\"eess.AS\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":611759,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2890290,"paper_url":"https://arxiv.org/abs/2507.18897","paper_title":"HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling","repo_url":"https://github.com/opendilab/HH-Codec","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}