{"ID":2898779,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.02666","arxiv_id":"2507.02666","title":"ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning","abstract":"In recent advancements in audio self-supervised representation learning, the standard Transformer architecture has emerged as the predominant approach, yet its attention mechanism often allocates a portion of attention weights to irrelevant information, potentially impairing the model's discriminative ability. To address this, we introduce a differential attention mechanism, which effectively mitigates ineffective attention allocation through the integration of dual-softmax operations and appropriately tuned differential coefficients. Experimental results demonstrate that our ASDA model achieves state-of-the-art (SOTA) performance across multiple benchmarks, including audio classification (49.0% mAP on AS-2M, 41.5% mAP on AS20K), keyword spotting (98.3% accuracy on SPC-2), and environmental sound classification (96.1% accuracy on ESC-50). These results highlight ASDA's effectiveness in audio tasks, paving the way for broader applications.","short_abstract":"In recent advancements in audio self-supervised representation learning, the standard Transformer architecture has emerged as the predominant approach, yet its attention mechanism often allocates a portion of attention weights to irrelevant information, potentially impairing the model's discriminative ability. To addre...","url_abs":"https://arxiv.org/abs/2507.02666","url_pdf":"https://arxiv.org/pdf/2507.02666v1","authors":"[\"Junyu Wang\",\"Tianrui Wang\",\"Meng Ge\",\"Longbiao Wang\",\"Jianwu Dang\"]","published":"2025-07-03T14:29:43Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.AI\",\"cs.CL\",\"eess.AS\"]","methods":"[\"Transformer\"]","has_code":false}