{"ID":2891060,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.18750","arxiv_id":"2507.18750","title":"CatchPhrase: EXPrompt-Guided Encoder Adaptation for Audio-to-Image Generation","abstract":"We propose CatchPhrase, a novel audio-to-image generation framework designed to mitigate semantic misalignment between audio inputs and generated images. While recent advances in multi-modal encoders have enabled progress in cross-modal generation, ambiguity stemming from homographs and auditory illusions continues to hinder accurate alignment. To address this issue, CatchPhrase generates enriched cross-modal semantic prompts (EXPrompt Mining) from weak class labels by leveraging large language models (LLMs) and audio captioning models (ACMs). To address both class-level and instance-level misalignment, we apply multi-modal filtering and retrieval to select the most semantically aligned prompt for each audio sample (EXPrompt Selector). A lightweight mapping network is then trained to adapt pre-trained text-to-image generation models to audio input. Extensive experiments on multiple audio classification datasets demonstrate that CatchPhrase improves audio-to-image alignment and consistently enhances generation quality by mitigating semantic misalignment.","short_abstract":"We propose CatchPhrase, a novel audio-to-image generation framework designed to mitigate semantic misalignment between audio inputs and generated images. While recent advances in multi-modal encoders have enabled progress in cross-modal generation, ambiguity stemming from homographs and auditory illusions continues to...","url_abs":"https://arxiv.org/abs/2507.18750","url_pdf":"https://arxiv.org/pdf/2507.18750v1","authors":"[\"Hyunwoo Oh\",\"SeungJu Cha\",\"Kwanyoung Lee\",\"Si-Woo Kim\",\"Dong-Jin Kim\"]","published":"2025-07-24T19:01:05Z","proceeding":"cs.MM","tasks":"[\"cs.MM\",\"cs.SD\",\"eess.AS\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
