{"ID":2882030,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.11598","arxiv_id":"2508.11598","title":"Representing Speech Through Autoregressive Prediction of Cochlear Tokens","abstract":"We introduce AuriStream, a biologically inspired model for encoding speech via a two-stage framework inspired by the human auditory processing hierarchy. The first stage transforms raw audio into a time-frequency representation based on the human cochlea, from which we extract discrete \\textbf{cochlear tokens}. The second stage applies an autoregressive sequence model over the cochlear tokens. AuriStream learns meaningful phoneme and word representations, and state-of-the-art lexical semantics. AuriStream shows competitive performance on diverse downstream SUPERB speech tasks. Complementing AuriStream's strong representational capabilities, it generates continuations of audio which can be visualized in a spectrogram space and decoded back into audio, providing insights into the model's predictions. In summary, we present a two-stage framework for speech representation learning to advance the development of more human-like models that efficiently handle a range of speech-based tasks.","short_abstract":"We introduce AuriStream, a biologically inspired model for encoding speech via a two-stage framework inspired by the human auditory processing hierarchy. The first stage transforms raw audio into a time-frequency representation based on the human cochlea, from which we extract discrete \\textbf{cochlear tokens}. The sec...","url_abs":"https://arxiv.org/abs/2508.11598","url_pdf":"https://arxiv.org/pdf/2508.11598v1","authors":"[\"Greta Tuckute\",\"Klemen Kotar\",\"Evelina Fedorenko\",\"Daniel L. K. Yamins\"]","published":"2025-08-15T17:06:04Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.SD\",\"eess.AS\"]","methods":"[]","has_code":false}