{"ID":2856719,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.10509","arxiv_id":"2510.10509","title":"MARS-Sep: Multimodal-Aligned Reinforced Sound Separation","abstract":"Universal sound separation faces a fundamental misalignment: models optimized for low-level signal metrics often produce semantically contaminated outputs, failing to suppress perceptually salient interference from acoustically similar sources. We introduce a preference alignment perspective, analogous to aligning LLMs with human intent. To address this, we introduce MARS-Sep, a reinforcement learning framework that reformulates separation as decision making. Instead of simply regressing ground-truth masks, MARS-Sep learns a factorized Beta mask policy that is steered by a preference reward model and optimized by a stable, clipped trust-region surrogate. The reward, derived from a progressively-aligned audio-text-vision encoder, directly incentivizes semantic consistency with query prompts. Extensive experiments on multiple benchmarks demonstrate consistent gains in Text-, Audio-, and Image-Queried separation, with notable improvements in signal metrics and semantic quality. Our code is available at https://github.com/mars-sep/MARS-Sep. Sound separation samples are available at https://mars-sep.github.io/.","short_abstract":"Universal sound separation faces a fundamental misalignment: models optimized for low-level signal metrics often produce semantically contaminated outputs, failing to suppress perceptually salient interference from acoustically similar sources. We introduce a preference alignment perspective, analogous to aligning LLMs...","url_abs":"https://arxiv.org/abs/2510.10509","url_pdf":"https://arxiv.org/pdf/2510.10509v2","authors":"[\"Zihan Zhang\",\"Xize Cheng\",\"Zhennan Jiang\",\"Dongjie Fu\",\"Jingyuan Chen\",\"Zhou Zhao\",\"Tao Jin\"]","published":"2025-10-12T09:05:28Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.AI\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\"]","has_code":false,"code_links":[{"ID":608376,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2856719,"paper_url":"https://arxiv.org/abs/2510.10509","paper_title":"MARS-Sep: Multimodal-Aligned Reinforced Sound Separation","repo_url":"https://github.com/mars-sep/MARS-Sep","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}