{"ID":2836385,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.21342","arxiv_id":"2511.21342","title":"Generating Separated Singing Vocals Using a Diffusion Model Conditioned on Music Mixtures","abstract":"Separating the individual elements in a musical mixture is an essential process for music analysis and practice. While this is generally addressed using neural networks optimized to mask or transform the time-frequency representation of a mixture to extract the target sources, the flexibility and generalization capabilities of generative diffusion models are giving rise to a novel class of solutions for this complicated task. In this work, we explore singing voice separation from real music recordings using a diffusion model which is trained to generate the solo vocals conditioned on the corresponding mixture. Our approach improves upon prior generative systems and achieves competitive objective scores against non-generative baselines when trained with supplementary data. The iterative nature of diffusion sampling enables the user to control the quality-efficiency trade-off, and also refine the output when needed. We present an ablation study of the sampling algorithm, highlighting the effects of the user-configurable parameters.","short_abstract":"Separating the individual elements in a musical mixture is an essential process for music analysis and practice. While this is generally addressed using neural networks optimized to mask or transform the time-frequency representation of a mixture to extract the target sources, the flexibility and generalization capabil...","url_abs":"https://arxiv.org/abs/2511.21342","url_pdf":"https://arxiv.org/pdf/2511.21342v1","authors":"[\"Genís Plaja-Roglans\",\"Yun-Ning Hung\",\"Xavier Serra\",\"Igor Pereira\"]","published":"2025-11-26T12:49:35Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.AI\"]","methods":"[\"Diffusion Model\"]","has_code":false}
