{"ID":2862802,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.26276","arxiv_id":"2509.26276","title":"Optimizing Speech Language Models for Acoustic Consistency","abstract":"We study speech language models that incorporate semantic initialization and planning losses to achieve robust and consistent generation. Our approach initializes speech tokens with self-supervised features, applies a light alignment loss, and trains with thinning and auxiliary objectives that target robustness and content planning. We train three models: a 0.7B speech-only model, a 1.0B speech-only model, and a 1.0B interleaved model with both text and speech. Acoustic studies show that the speech-only models achieve the highest consistency across speaker, gender, sentiment, room, and background factors, surpassing larger systems. Interleaving improves lexical and syntactic probes and semantic--acoustic alignment but reduces consistency. Linear probes show that our initialization biases the model toward content structure while trading off prosody detail. These results show that LM-side design and training mix control the balance between acoustic stability and semantic grounding without changes to the tokenizer or runtime architecture. A demo and model weights are available for exploration.","short_abstract":"We study speech language models that incorporate semantic initialization and planning losses to achieve robust and consistent generation. Our approach initializes speech tokens with self-supervised features, applies a light alignment loss, and trains with thinning and auxiliary objectives that target robustness and con...","url_abs":"https://arxiv.org/abs/2509.26276","url_pdf":"https://arxiv.org/pdf/2509.26276v1","authors":"[\"Morteza Rohanian\",\"Michael Krauthammer\"]","published":"2025-09-30T13:59:52Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.SD\"]","methods":"[\"Language Model\",\"LoRA\"]","has_code":false}