{"ID":2921125,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-04T06:21:04.369492701Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.01802","arxiv_id":"2606.01802","title":"MOSS-Audio Technical Report","abstract":"MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder generates autoregressive text outputs. Two design choices are central to the system: \\textbf{DeepStack cross-layer feature injection}, which exposes the decoder to acoustic information from multiple encoder depths, and \\textbf{time markers}, which provide explicit temporal cues by inserting timestamp markers into the audio-token stream. At the data level, we design an event-preserving audio annotation pipeline that segments raw audio at coherent event boundaries, applies branch-specific annotation to speech, music, and general audio, and merges the results into unified captions for pretraining. The intermediate branch-specific captions are further retained to support the construction of task-oriented SFT data. The model is pretrained on large-scale audio-language data, with time-aware objectives incorporated to support temporal grounding, and then undergoes multi-stage post-training to enhance instruction following and audio-grounded reasoning. We release 4B and 8B variants in both Instruct and Thinking configurations. MOSS-Audio achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR, positioning it as a promising understanding foundation for future voice agents.","short_abstract":"MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the...","url_abs":"https://arxiv.org/abs/2606.01802","url_pdf":"https://arxiv.org/pdf/2606.01802v1","authors":"[\"Chen Yang\",\"Chufan Yu\",\"Hanfu Chen\",\"Jie Zhu\",\"Jingqi Chen\",\"Ke Chen\",\"Wenxuan Wang\",\"Yang Wang\",\"Yaozhou Jiang\",\"Yi Jiang\",\"Zhengyuan Lin\",\"Ziqi Chen\",\"Zhaoye Fei\",\"Chenghao Liu\",\"Jun Zhan\",\"Kang Yu\",\"Kexin Huang\",\"Mingshu Chen\",\"Qinyuan Cheng\",\"Ruixiao Li\",\"Shimin Li\",\"Songlin Wang\",\"Yang Gao\",\"Yiyang Zhang\",\"Xipeng Qiu\"]","published":"2026-06-01T07:19:22Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.AI\"]","methods":"[\"Language Model\"]","has_code":false}