{"ID":2865855,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.20891","arxiv_id":"2509.20891","title":"AIBA: Attention-based Instrument Band Alignment for Text-to-Audio Diffusion","abstract":"We present AIBA (Attention-In-Band Alignment), a lightweight, training-free pipeline to quantify where text-to-audio diffusion models attend on the time-frequency (T-F) plane. AIBA (i) hooks cross-attention at inference to record attention probabilities without modifying weights; (ii) projects them to fixed-size mel grids that are directly comparable to audio energy; and (iii) scores agreement with instrument-band ground truth via interpretable metrics (T-F IoU/AP, frequency-profile correlation, and a pointing game). On Slakh2100 with an AudioLDM2 backbone, AIBA reveals consistent instrument-dependent trends (e.g., bass favoring low bands) and achieves high precision with moderate recall.","short_abstract":"We present AIBA (Attention-In-Band Alignment), a lightweight, training-free pipeline to quantify where text-to-audio diffusion models attend on the time-frequency (T-F) plane. AIBA (i) hooks cross-attention at inference to record attention probabilities without modifying weights; (ii) projects them to fixed-size mel gr...","url_abs":"https://arxiv.org/abs/2509.20891","url_pdf":"https://arxiv.org/pdf/2509.20891v1","authors":"[\"Junyoung Koh\",\"Soo Yong Kim\",\"Gyu Hyeong Choi\",\"Yongwon Choi\"]","published":"2025-09-25T08:28:41Z","proceeding":"cs.SD","tasks":"[\"cs.SD\"]","methods":"[\"Diffusion Model\"]","has_code":false}