{"ID":2856028,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.10889","arxiv_id":"2510.10889","title":"Topological Alignment of Shared Vision-Language Embedding Space","abstract":"Contrastive Vision-Language Models (VLMs) have demonstrated strong zero-shot capabilities. However, their cross-modal alignment remains biased toward English due to limited multilingual multimodal data. Recent multilingual extensions have alleviated this gap but enforce instance-level alignment while neglecting the global geometry of the shared embedding space. We address this problem by introducing ToMCLIP (Topological Alignment for Multilingual CLIP), a topology-aware framework aligning embedding spaces with topology-preserving constraints. The proposed method applies persistent homology to define a topological alignment loss and approximates persistence diagram with theoretical error bounds using graph sparsification strategy. This work validates the proposed approach, showing enhanced structural coherence of multilingual representations, higher zero-shot accuracy on the CIFAR-100, and stronger multilingual retrieval performance on the xFlickr\u0026CO. Beyond VLMs, the proposed approach provides a general method for incorporating topological alignment into representation learning. Code is available at https://github.com/junwon0/ToMCLIP.git.","short_abstract":"Contrastive Vision-Language Models (VLMs) have demonstrated strong zero-shot capabilities. However, their cross-modal alignment remains biased toward English due to limited multilingual multimodal data. Recent multilingual extensions have alleviated this gap but enforce instance-level alignment while neglecting the glo...","url_abs":"https://arxiv.org/abs/2510.10889","url_pdf":"https://arxiv.org/pdf/2510.10889v2","authors":"[\"Junwon You\",\"Dasol Kang\",\"Jae-Hun Jung\"]","published":"2025-10-13T01:36:38Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":608304,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2856028,"paper_url":"https://arxiv.org/abs/2510.10889","paper_title":"Topological Alignment of Shared Vision-Language Embedding Space","repo_url":"https://github.com/junwon0/ToMCLIP.git","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}