Enhancing Audio Captioning with Auxiliary AudioSet Semantics

eess.AS arXiv:2606.05717
View PDF arXiv JSON

Abstract

Automatic Audio Captioning (AAC) seeks to generate natural language descriptions of complex acoustic scenes, bridging auditory perception and language understanding. However, word-selection indeterminacy and increasing reliance on large-scale sequence-to-sequence or LLM-based models limit practical deployment. We propose a resource-efficient AAC framework that explicitly grounds caption generation in auxiliary AudioSet semantics. Frame-level acoustic representations extracted using a ConvNeXt encoder are augmented with top-$K$ predicted AudioSet keywords, providing structured contextual cues for decoding. A compact six-layer BART-style decoder conditions on this joint acoustic-semantic representation, enabling caption generation without LLM-scale decoding. The proposed design balances semantic grounding and computational efficiency within a compact architecture. Evaluations on Clotho V2 and AudioCaps confirm competitive caption quality under practical deployment constraints.

PDF Viewer