{"ID":3084860,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-07T05:16:48.22291569Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05717","arxiv_id":"2606.05717","title":"Enhancing Audio Captioning with Auxiliary AudioSet Semantics","abstract":"Automatic Audio Captioning (AAC) seeks to generate natural language descriptions of complex acoustic scenes, bridging auditory perception and language understanding. However, word-selection indeterminacy and increasing reliance on large-scale sequence-to-sequence or LLM-based models limit practical deployment. We propose a resource-efficient AAC framework that explicitly grounds caption generation in auxiliary AudioSet semantics. Frame-level acoustic representations extracted using a ConvNeXt encoder are augmented with top-$K$ predicted AudioSet keywords, providing structured contextual cues for decoding. A compact six-layer BART-style decoder conditions on this joint acoustic-semantic representation, enabling caption generation without LLM-scale decoding. The proposed design balances semantic grounding and computational efficiency within a compact architecture. Evaluations on Clotho V2 and AudioCaps confirm competitive caption quality under practical deployment constraints.","short_abstract":"Automatic Audio Captioning (AAC) seeks to generate natural language descriptions of complex acoustic scenes, bridging auditory perception and language understanding. However, word-selection indeterminacy and increasing reliance on large-scale sequence-to-sequence or LLM-based models limit practical deployment. We propo...","url_abs":"https://arxiv.org/abs/2606.05717","url_pdf":"https://arxiv.org/pdf/2606.05717v1","authors":"[\"Shubham Gupta\",\"Adarsh Arigala\",\"Sri Rama Murty Kodukula\"]","published":"2026-06-04T05:18:01Z","proceeding":"eess.AS","tasks":"[\"eess.AS\"]","methods":"[\"Large Language Model\"]","has_code":false}