{"ID":2894307,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.11096","arxiv_id":"2507.11096","title":"EditGen: Harnessing Cross-Attention Control for Instruction-Based Auto-Regressive Audio Editing","abstract":"In this study, we investigate leveraging cross-attention control for efficient audio editing within auto-regressive models. Inspired by image editing methodologies, we develop a Prompt-to-Prompt-like approach that guides edits through cross and self-attention mechanisms. Integrating a diffusion-based strategy, influenced by Auffusion, we extend the model's functionality to support refinement edits, establishing a baseline for prompt-guided audio editing. Additionally, we introduce an alternative approach by incorporating MUSICGEN, a pre-trained frozen auto-regressive model, and propose three editing mechanisms, based on Replacement, Reweighting, and Refinement of the attention scores. We employ commonly-used music-specific evaluation metrics and a human study, to gauge time-varying controllability, adherence to global text cues, and overall audio realism. The automatic and human evaluations indicate that the proposed combination of prompt-to-prompt guidance with autoregressive generation models significantly outperforms the diffusion-based baseline in terms of melody, dynamics, and tempo of the generated audio. Our code is available at https://github.com/billsioros/EditGen","short_abstract":"In this study, we investigate leveraging cross-attention control for efficient audio editing within auto-regressive models. Inspired by image editing methodologies, we develop a Prompt-to-Prompt-like approach that guides edits through cross and self-attention mechanisms. Integrating a diffusion-based strategy, influenc...","url_abs":"https://arxiv.org/abs/2507.11096","url_pdf":"https://arxiv.org/pdf/2507.11096v1","authors":"[\"Vassilis Sioros\",\"Alexandros Potamianos\",\"Giorgos Paraskevopoulos\"]","published":"2025-07-15T08:44:11Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.AI\",\"eess.AS\"]","methods":"[\"Diffusion Model\"]","has_code":false,"code_links":[{"ID":612106,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2894307,"paper_url":"https://arxiv.org/abs/2507.11096","paper_title":"EditGen: Harnessing Cross-Attention Control for Instruction-Based Auto-Regressive Audio Editing","repo_url":"https://github.com/billsioros/EditGen","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}