{"ID":2870130,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.12591","arxiv_id":"2509.12591","title":"MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models","abstract":"Automated Audio Captioning (AAC) generates captions for audio clips but faces challenges due to limited datasets compared to image captioning. To overcome this, we propose the zero-shot AAC system that leverages pre-trained models, eliminating the need for extensive training. Our approach uses a pre-trained audio CLIP model to extract auditory features and generate a structured prompt, which guides a Large Language Model (LLM) in caption generation. Unlike traditional greedy decoding, our method refines token selection through the audio CLIP model, ensuring alignment with the audio content. Experimental results demonstrate a 35% improvement in NLG mean score (from 4.7 to 7.3) using MAGIC search with the WavCaps model. The performance is heavily influenced by the audio-text matching model and keyword selection, with optimal results achieved using a single keyword prompt, and a 50% performance drop when no keyword list is used.","short_abstract":"Automated Audio Captioning (AAC) generates captions for audio clips but faces challenges due to limited datasets compared to image captioning. To overcome this, we propose the zero-shot AAC system that leverages pre-trained models, eliminating the need for extensive training. Our approach uses a pre-trained audio CLIP...","url_abs":"https://arxiv.org/abs/2509.12591","url_pdf":"https://arxiv.org/pdf/2509.12591v1","authors":"[\"Vijay Govindarajan\",\"Pratik Patel\",\"Sahil Tripathi\",\"Md Azizul Hoque\",\"Gautam Siddharth Kashyap\"]","published":"2025-09-16T02:36:00Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
