{"ID":2827301,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.18099","arxiv_id":"2512.18099","title":"SAM Audio: Segment Anything in Audio","abstract":"General audio source separation is a key capability for multimodal AI systems that can perceive and reason about sound. Despite substantial progress in recent years, existing separation models are either domain-specific, designed for fixed categories such as speech or music, or limited in controllability, supporting only a single prompting modality such as text. In this work, we present SAM Audio, a foundation model for general audio separation that unifies text, visual, and temporal span prompting within a single framework. Built on a diffusion transformer architecture, SAM Audio is trained with flow matching on large-scale audio data spanning speech, music, and general sounds, and can flexibly separate target sources described by language, visual masks, or temporal spans. The model achieves state-of-the-art performance across a diverse suite of benchmarks, including general sound, speech, music, and musical instrument separation in both in-the-wild and professionally produced audios, substantially outperforming prior general-purpose and specialized systems. Furthermore, we introduce a new real-world separation benchmark with human-labeled multimodal prompts and a reference-free evaluation model that correlates strongly with human judgment.","short_abstract":"General audio source separation is a key capability for multimodal AI systems that can perceive and reason about sound. Despite substantial progress in recent years, existing separation models are either domain-specific, designed for fixed categories such as speech or music, or limited in controllability, supporting on...","url_abs":"https://arxiv.org/abs/2512.18099","url_pdf":"https://arxiv.org/pdf/2512.18099v1","authors":"[\"Bowen Shi\",\"Andros Tjandra\",\"John Hoffman\",\"Helin Wang\",\"Yi-Chiao Wu\",\"Luya Gao\",\"Julius Richter\",\"Matt Le\",\"Apoorv Vyas\",\"Sanyuan Chen\",\"Christoph Feichtenhofer\",\"Piotr Dollár\",\"Wei-Ning Hsu\",\"Ann Lee\"]","published":"2025-12-19T22:14:23Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.CV\"]","methods":"[\"Diffusion Model\",\"Transformer\"]","has_code":false}