{"ID":2824244,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.23278","arxiv_id":"2512.23278","title":"Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-step High-Fidelity Audio Generation","abstract":"Existing dominant methods for audio generation include Generative Adversarial Networks (GANs) and diffusion-based methods like Flow Matching. GANs suffer from slow convergence during training, while diffusion methods require multi-step inference that introduces considerable computational overhead. In this work, we introduce Flow2GAN, a two-stage framework that combines Flow Matching training for learning generative capabilities with GAN fine-tuning for efficient few-step inference. Specifically, given audio's unique properties, we first improve Flow Matching for audio modeling through: 1) reformulating the objective as endpoint estimation, avoiding velocity estimation difficulties when involving empty regions; 2) applying spectral energy-based loss scaling to emphasize perceptually salient quieter regions. Building on these Flow Matching adaptations, we demonstrate that a further stage of lightweight GAN fine-tuning enables us to obtain few-step (e.g., 1/2/4 steps) generators that produce high-quality audio. In addition, we develop a multi-branch network architecture that processes Fourier coefficients at different time-frequency resolutions, which improves the modeling capabilities compared to prior single-resolution designs. Experimental results indicate that our Flow2GAN delivers high-fidelity audio generation from Mel-spectrograms or discrete audio tokens, achieving highly favorable quality-efficiency trade-offs compared to existing state-of-the-art GAN-based and Flow Matching-based methods. Online demo samples are available at https://flow2gan.github.io, and the source code is released at https://github.com/k2-fsa/Flow2GAN.","short_abstract":"Existing dominant methods for audio generation include Generative Adversarial Networks (GANs) and diffusion-based methods like Flow Matching. GANs suffer from slow convergence during training, while diffusion methods require multi-step inference that introduces considerable computational overhead. In this work, we intr...","url_abs":"https://arxiv.org/abs/2512.23278","url_pdf":"https://arxiv.org/pdf/2512.23278v2","authors":"[\"Zengwei Yao\",\"Wei Kang\",\"Han Zhu\",\"Liyong Guo\",\"Lingxuan Ye\",\"Fangjun Kuang\",\"Weiji Zhuang\",\"Zhaoqing Li\",\"Zhifeng Han\",\"Long Lin\",\"Daniel Povey\"]","published":"2025-12-29T08:01:59Z","proceeding":"eess.AS","tasks":"[\"eess.AS\"]","methods":"[\"Diffusion Model\",\"Generative Adversarial Network\"]","has_code":false,"code_links":[{"ID":605563,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2824244,"paper_url":"https://arxiv.org/abs/2512.23278","paper_title":"Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-step High-Fidelity Audio Generation","repo_url":"https://github.com/k2-fsa/Flow2GAN","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
