{"ID":2827935,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.22166","arxiv_id":"2512.22166","title":"AudioGAN: A Compact and Efficient Framework for Real-Time High-Fidelity Text-to-Audio Generation","abstract":"Text-to-audio (TTA) generation can significantly benefit the media industry by reducing production costs and enhancing work efficiency. However, most current TTA models (primarily diffusion-based) suffer from slow inference speeds and high computational costs. In this paper, we introduce AudioGAN, the first successful Generative Adversarial Networks (GANs)-based TTA framework that generates audio in a single pass, thereby reducing model complexity and inference time. To overcome the inherent difficulties in training GANs, we integrate multiple ,contrastive losses and propose innovative components Single-Double-Triple (SDT) Attention and Time-Frequency Cross-Attention (TF-CA). Extensive experiments on the AudioCaps dataset demonstrate that AudioGAN achieves state-of-the-art performance while using 90% fewer parameters and running 20 times faster, synthesizing audio in under one second. These results establish AudioGAN as a practical and powerful solution for real-time TTA.","short_abstract":"Text-to-audio (TTA) generation can significantly benefit the media industry by reducing production costs and enhancing work efficiency. However, most current TTA models (primarily diffusion-based) suffer from slow inference speeds and high computational costs. In this paper, we introduce AudioGAN, the first successful...","url_abs":"https://arxiv.org/abs/2512.22166","url_pdf":"https://arxiv.org/pdf/2512.22166v1","authors":"[\"HaeChun Chung\"]","published":"2025-12-17T09:13:23Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"eess.AS\"]","methods":"[\"Diffusion Model\",\"Generative Adversarial Network\"]","has_code":false}