{"ID":2887836,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.00955","arxiv_id":"2508.00955","title":"From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model","abstract":"Adapting generative Multimodal Large Language Models (MLLMs) into universal embedding models typically demands resource-intensive contrastive pre-training, while traditional hard negative mining methods suffer from severe false negative contamination. In this paper, we propose a highly data-efficient framework that bypasses extensive pre-training to build a robust multimodal representation space. We first introduce a hierarchical embedding prompt that provides strong latent conditioning. By explicitly anchoring task definitions at the system level, this prompting strategy effectively bridges the modality gap and unlocks powerful zero-shot embedding capabilities. Building upon this latent conditioning, we present Self-aware Hard Negative Sampling (SaHa). Unlike conventional candidate-space mining, SaHa shifts the mechanism to the query-space by mapping retrieved candidates back to their owner queries to rigorously filter out semantic false negatives. Furthermore, our method constructs mutually hard clusters, maximizing intra-task discrimination and batch efficiency without redundant forward passes. Extensive experiments demonstrate that our unified approach achieves highly competitive fine-tuning performance on the Massive Multimodal Embedding Benchmark using only a fraction of standard training data.","short_abstract":"Adapting generative Multimodal Large Language Models (MLLMs) into universal embedding models typically demands resource-intensive contrastive pre-training, while traditional hard negative mining methods suffer from severe false negative contamination. In this paper, we propose a highly data-efficient framework that byp...","url_abs":"https://arxiv.org/abs/2508.00955","url_pdf":"https://arxiv.org/pdf/2508.00955v2","authors":"[\"Yeong-Joon Ju\",\"Seong-Whan Lee\"]","published":"2025-08-01T07:31:24Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.IR\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}