{"ID":2836377,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.21331","arxiv_id":"2511.21331","title":"The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment","abstract":"Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence. We evaluate ConFu on synthetic and real-world multimodal benchmarks, assessing its ability to exploit cross-modal complementarity, capture higher-order dependencies, and scale with increasing multimodal complexity. Across these settings, ConFu demonstrates competitive performance on retrieval and classification tasks, while supporting unified one-to-one and two-to-one retrieval within a single contrastive framework. We release our code and dataset at https://github.com/estafons/confu.","short_abstract":"Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they of...","url_abs":"https://arxiv.org/abs/2511.21331","url_pdf":"https://arxiv.org/pdf/2511.21331v2","authors":"[\"Stefanos Koutoupis\",\"Michaela Areti Zervou\",\"Konstantinos Kontras\",\"Maarten De Vos\",\"Panagiotis Tsakalides\",\"Grigorios Tsagkatakis\"]","published":"2025-11-26T12:25:55Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[]","has_code":false,"code_links":[{"ID":606594,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2836377,"paper_url":"https://arxiv.org/abs/2511.21331","paper_title":"The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment","repo_url":"https://github.com/estafons/confu","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}