{"ID":2878997,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.17497","arxiv_id":"2508.17497","title":"Multimodal Representation Learning Conditioned on Semantic Relations","abstract":"Multimodal representation learning has been largely driven by contrastive models such as CLIP, which learn a shared embedding space by aligning paired image-text samples. While effective for general-purpose representation learning, such models typically produce a single embedding per sample that is reused across different semantic relations and contexts. However, in many real-world applications, relevance between samples is inherently relation-dependent, with different semantic relations emphasizing different aspects of multimodal data. In this work, we propose Relation-Conditioned Multimodal Learning (RCML), a framework that treats semantic relations as explicit conditions of multimodal representation learning. Rather than producing relation-agnostic embeddings, RCML learns representations conditioned on natural-language relation descriptions, allowing the same sample to be represented differently under different relational contexts. The framework constructs relation-aware training pairs, introduces a relation-conditioned module to adapt embeddings to relation semantics, and employs a unified contrastive objective to jointly model cross-modal alignment and relation-induced inter-sample structure. Experiments on multiple datasets show that RCML consistently outperforms strong baselines on retrieval and classification tasks in zero-shot, fine-tuned, and out-of-domain settings, highlighting the effectiveness of leveraging semantic relations to guide multimodal representation learning.","short_abstract":"Multimodal representation learning has been largely driven by contrastive models such as CLIP, which learn a shared embedding space by aligning paired image-text samples. While effective for general-purpose representation learning, such models typically produce a single embedding per sample that is reused across differ...","url_abs":"https://arxiv.org/abs/2508.17497","url_pdf":"https://arxiv.org/pdf/2508.17497v2","authors":"[\"Yang Qiao\",\"Yuntong Hu\",\"Bowen Zhu\",\"Hasibul Haque\",\"Liang Zhao\"]","published":"2025-08-24T19:36:18Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[]","has_code":false}