{"ID":2868386,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.16603","arxiv_id":"2509.16603","title":"An Octave-based Multi-Resolution CQT Architecture for Diffusion-based Audio Generation","abstract":"This paper introduces MR-CQTdiff, a novel neural-network architecture for diffusion-based audio generation that leverages a multi-resolution Constant-$Q$ Transform (C$Q$T). The proposed architecture employs an efficient, invertible CQT framework that adjusts the time-frequency resolution on an octave-by-octave basis. This design addresses the issue of low temporal resolution at lower frequencies, enabling more flexible and expressive audio generation. We conduct an evaluation using the Fréchet Audio Distance (FAD) metric across various architectures and two datasets. Experimental results demonstrate that MR-CQTdiff achieves state-of-the-art audio quality, outperforming competing architectures.","short_abstract":"This paper introduces MR-CQTdiff, a novel neural-network architecture for diffusion-based audio generation that leverages a multi-resolution Constant-$Q$ Transform (C$Q$T). The proposed architecture employs an efficient, invertible CQT framework that adjusts the time-frequency resolution on an octave-by-octave basis. T...","url_abs":"https://arxiv.org/abs/2509.16603","url_pdf":"https://arxiv.org/pdf/2509.16603v1","authors":"[\"Maurício do V. M. da Costa\",\"Eloi Moliner\"]","published":"2025-09-20T09:57:37Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.SD\"]","methods":"[\"Diffusion Model\"]","has_code":false}