{"ID":2852856,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.17685","arxiv_id":"2510.17685","title":"Multilingual Text-to-Image Person Retrieval via Bidirectional Relation Reasoning and Aligning","abstract":"Text-to-image person retrieval (TIPR) aims to identify the target person using textual descriptions, facing challenge in modality heterogeneity. Prior works have attempted to address it by developing cross-modal global or local alignment strategies. However, global methods typically overlook fine-grained cross-modal differences, whereas local methods require prior information to explore explicit part alignments. Additionally, current methods are English-centric, restricting their application in multilingual contexts. To alleviate these issues, we pioneer a multilingual TIPR task by developing a multilingual TIPR benchmark, for which we leverage large language models for initial translations and refine them by integrating domain-specific knowledge. Correspondingly, we propose Bi-IRRA: a Bidirectional Implicit Relation Reasoning and Aligning framework to learn alignment across languages and modalities. Within Bi-IRRA, a bidirectional implicit relation reasoning module enables bidirectional prediction of masked image and text, implicitly enhancing the modeling of local relations across languages and modalities, a multi-dimensional global alignment module is integrated to bridge the modality heterogeneity. The proposed method achieves new state-of-the-art results on all multilingual TIPR datasets. Data and code are presented in https://github.com/Flame-Chasers/Bi-IRRA.","short_abstract":"Text-to-image person retrieval (TIPR) aims to identify the target person using textual descriptions, facing challenge in modality heterogeneity. Prior works have attempted to address it by developing cross-modal global or local alignment strategies. However, global methods typically overlook fine-grained cross-modal di...","url_abs":"https://arxiv.org/abs/2510.17685","url_pdf":"https://arxiv.org/pdf/2510.17685v1","authors":"[\"Min Cao\",\"Xinyu Zhou\",\"Ding Jiang\",\"Bo Du\",\"Mang Ye\",\"Min Zhang\"]","published":"2025-10-20T16:01:11Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":608032,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2852856,"paper_url":"https://arxiv.org/abs/2510.17685","paper_title":"Multilingual Text-to-Image Person Retrieval via Bidirectional Relation Reasoning and Aligning","repo_url":"https://github.com/Flame-Chasers/Bi-IRRA","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}