{"ID":2863704,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.24924","arxiv_id":"2509.24924","title":"SAGA-SR: Semantically and Acoustically Guided Audio Super-Resolution","abstract":"Versatile audio super-resolution (SR) aims to predict high-frequency components from low-resolution audio across diverse domains such as speech, music, and sound effects. Existing diffusion-based SR methods often fail to produce semantically aligned outputs and struggle with consistent high-frequency reconstruction. In this paper, we propose SAGA-SR, a versatile audio SR model that combines semantic and acoustic guidance. Based on a DiT backbone trained with a flow matching objective, SAGA-SR is conditioned on text and spectral roll-off embeddings. Due to the effective guidance provided by its conditioning, SAGA-SR robustly upsamples audio from arbitrary input sampling rates between 4 kHz and 32 kHz to 44.1 kHz. Both objective and subjective evaluations show that SAGA-SR achieves state-of-the-art performance across all test cases. Sound examples and code for the proposed model are available online.","short_abstract":"Versatile audio super-resolution (SR) aims to predict high-frequency components from low-resolution audio across diverse domains such as speech, music, and sound effects. Existing diffusion-based SR methods often fail to produce semantically aligned outputs and struggle with consistent high-frequency reconstruction. In...","url_abs":"https://arxiv.org/abs/2509.24924","url_pdf":"https://arxiv.org/pdf/2509.24924v1","authors":"[\"Jaekwon Im\",\"Juhan Nam\"]","published":"2025-09-29T15:25:48Z","proceeding":"eess.AS","tasks":"[\"eess.AS\"]","methods":"[\"Diffusion Model\"]","has_code":false}
