{"ID":2843700,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.06653","arxiv_id":"2511.06653","title":"HiMo-CLIP: Modeling Semantic Hierarchy and Monotonicity in Vision-Language Alignment","abstract":"Contrastive vision-language models like CLIP have achieved impressive results in image-text retrieval by aligning image and text representations in a shared embedding space. However, these models often treat text as flat sequences, limiting their ability to handle complex, compositional, and long-form descriptions. In particular, they fail to capture two essential properties of language: semantic hierarchy, which reflects the multi-level compositional structure of text, and semantic monotonicity, where richer descriptions should result in stronger alignment with visual content.To address these limitations, we propose HiMo-CLIP, a representation-level framework that enhances CLIP-style models without modifying the encoder architecture. HiMo-CLIP introduces two key components: a hierarchical decomposition (HiDe) module that extracts latent semantic components from long-form text via in-batch PCA, enabling flexible, batch-aware alignment across different semantic granularities, and a monotonicity-aware contrastive loss (MoLo) that jointly aligns global and component-level representations, encouraging the model to internalize semantic ordering and alignment strength as a function of textual completeness.These components work in concert to produce structured, cognitively-aligned cross-modal representations. Experiments on multiple image-text retrieval benchmarks show that HiMo-CLIP consistently outperforms strong baselines, particularly under long or compositional descriptions. The code is available at https://github.com/UnicomAI/HiMo-CLIP.","short_abstract":"Contrastive vision-language models like CLIP have achieved impressive results in image-text retrieval by aligning image and text representations in a shared embedding space. However, these models often treat text as flat sequences, limiting their ability to handle complex, compositional, and long-form descriptions. In...","url_abs":"https://arxiv.org/abs/2511.06653","url_pdf":"https://arxiv.org/pdf/2511.06653v1","authors":"[\"Ruijia Wu\",\"Ping Chen\",\"Fei Shen\",\"Shaoan Zhao\",\"Qiang Hui\",\"Huanlin Gao\",\"Ting Lu\",\"Zhaoxiang Liu\",\"Fang Zhao\",\"Kai Wang\",\"Shiguo Lian\"]","published":"2025-11-10T03:04:36Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.CL\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":607245,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2843700,"paper_url":"https://arxiv.org/abs/2511.06653","paper_title":"HiMo-CLIP: Modeling Semantic Hierarchy and Monotonicity in Vision-Language Alignment","repo_url":"https://github.com/UnicomAI/HiMo-CLIP","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
