{"ID":2867492,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.17359","arxiv_id":"2509.17359","title":"MLLM-Driven Semantic Identifier Generation for Generative Cross-Modal Retrieval","abstract":"Generative cross-modal retrieval, which treats retrieval as a generation task, has emerged as a promising direction with the rise of Multimodal Large Language Models (MLLMs). In this setting, the model responds to a text query by generating an identifier corresponding to the target image. However, existing methods typically rely on manually crafted string IDs, clustering-based labels, or atomic identifiers requiring vocabulary expansion, all of which face challenges in semantic alignment or scalability.To address these limitations, we propose a vocabulary-efficient identifier generation framework that prompts MLLMs to generate Structured Semantic Identifiers from image-caption pairs. These identifiers are composed of concept-level tokens such as objects and actions, naturally aligning with the model's generation space without modifying the tokenizer. Additionally, we introduce a Rationale-Guided Supervision Strategy, prompting the model to produce a one-sentence explanation alongside each identifier serves as an auxiliary supervision signal that improves semantic grounding and reduces hallucinations during training.","short_abstract":"Generative cross-modal retrieval, which treats retrieval as a generation task, has emerged as a promising direction with the rise of Multimodal Large Language Models (MLLMs). In this setting, the model responds to a text query by generating an identifier corresponding to the target image. However, existing methods typi...","url_abs":"https://arxiv.org/abs/2509.17359","url_pdf":"https://arxiv.org/pdf/2509.17359v2","authors":"[\"Tianyuan Li\",\"Lei Wang\",\"Ahtamjan Ahmat\",\"Yating Yang\",\"Bo Ma\",\"Rui Dong\",\"Bangju Han\"]","published":"2025-09-22T05:23:06Z","proceeding":"cs.IR","tasks":"[\"cs.IR\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
