{"ID":3083612,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-07T03:54:17.966829144Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.06357","arxiv_id":"2606.06357","title":"F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation","abstract":"Continuous audio autoencoders reconstruct waveforms well but often produce latents with weak structure for understanding, while self-supervised audio encoders capture semantics but are not directly decodable. This mismatch complicates a single audio tokenizer that must support both understanding and generation. We adapt continuous autoencoder latents to this setting with two components: a noise-regularized autoencoder bottleneck and a latent-side representation encoder. The bottleneck uses channel normalization and stochastic perturbation instead of KL-based variational training, yielding scale-controlled continuous latents for reconstruction and autoregressive generation. The representation encoder is trained on frozen autoencoder latents with RQ-MTP and frozen-LLM supervision. The resulting tokenizer provides high-dimensional representations for understanding while preserving normalized continuous latents as generation targets","short_abstract":"Continuous audio autoencoders reconstruct waveforms well but often produce latents with weak structure for understanding, while self-supervised audio encoders capture semantics but are not directly decodable. This mismatch complicates a single audio tokenizer that must support both understanding and generation. We adap...","url_abs":"https://arxiv.org/abs/2606.06357","url_pdf":"https://arxiv.org/pdf/2606.06357v1","authors":"[\"Dinghao Zhou\",\"Xingchen Song\",\"Di Wu\",\"Pengyu Cheng\",\"Shengfan Shen\",\"Sixiang Lv\"]","published":"2026-06-04T16:25:07Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.AI\",\"eess.AS\"]","methods":"[\"Large Language Model\"]","has_code":false}