{"ID":2845869,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.03601","arxiv_id":"2511.03601","title":"Step-Audio-EditX Technical Report","abstract":"We present Step-Audio-EditX, the first open-source LLM-based audio model excelling at expressive and iterative audio editing encompassing emotion, speaking style, and paralinguistics alongside robust zero-shot text-to-speech (TTS) capabilities. Our core innovation lies in leveraging only large-margin synthetic data, which circumvents the need for embedding-based priors or auxiliary modules. This large-margin learning approach enables both iterative control and high expressivity across voices, and represents a fundamental pivot from the conventional focus on representation-level disentanglement. Evaluation results demonstrate that Step-Audio-EditX surpasses both MiniMax-2.6-hd and Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.","short_abstract":"We present Step-Audio-EditX, the first open-source LLM-based audio model excelling at expressive and iterative audio editing encompassing emotion, speaking style, and paralinguistics alongside robust zero-shot text-to-speech (TTS) capabilities. Our core innovation lies in leveraging only large-margin synthetic data, wh...","url_abs":"https://arxiv.org/abs/2511.03601","url_pdf":"https://arxiv.org/pdf/2511.03601v2","authors":"[\"Chao Yan\",\"Boyong Wu\",\"Peng Yang\",\"Pengfei Tan\",\"Guoqiang Hu\",\"Li Xie\",\"Yuxin Zhang\",\"Xiangyu\",\"Zhang\",\"Fei Tian\",\"Xuerui Yang\",\"Xiangyu Zhang\",\"Daxin Jiang\",\"Shuchang Zhou\",\"Gang Yu\"]","published":"2025-11-05T16:22:19Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.HC\",\"cs.SD\",\"eess.AS\"]","methods":"[\"Large Language Model\"]","has_code":false}