{"ID":2850511,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.21209","arxiv_id":"2510.21209","title":"SpecTokenizer: A Lightweight Streaming Codec in the Compressed Spectrum Domain","abstract":"Neural Audio Codecs (NACs) have gained growing attention in recent years as technologies for audio compression and audio representation in speech language models. While mainstream NACs typically require G-level computation and M-level parameters, the performance of lightweight and streaming NACs remains underexplored. This paper proposes SpecTokenizer, a lightweight streaming codec that operates in the compressed spectral domain. Composed solely of alternating CNN and RNN layers, SpecTokenizer achieves greater efficiency and better representational capability through multi-scale modeling in the compressed spectrum domain. At 4 kbps, the proposed SpecTokenizer achieves comparable or superior performance compared to the codec with state-of-the-art lightweight architecture while requiring only 20% of the computation and 10% of the parameters. Furthermore, it significantly outperforms the codec when using similar computational and storage resources.","short_abstract":"Neural Audio Codecs (NACs) have gained growing attention in recent years as technologies for audio compression and audio representation in speech language models. While mainstream NACs typically require G-level computation and M-level parameters, the performance of lightweight and streaming NACs remains underexplored....","url_abs":"https://arxiv.org/abs/2510.21209","url_pdf":"https://arxiv.org/pdf/2510.21209v1","authors":"[\"Zixiang Wan\",\"Guochang Zhang\",\"Yifeng He\",\"Jianqiang Wei\"]","published":"2025-10-24T07:25:13Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.SD\"]","methods":"[\"Language Model\",\"Convolutional Neural Network\"]","has_code":false}