{"ID":2872215,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.09459","arxiv_id":"2509.09459","title":"Boosting Data Utilization for Multilingual Dense Retrieval","abstract":"Multilingual dense retrieval aims to retrieve relevant documents across different languages based on a unified retriever model. The challenge lies in aligning representations of different languages in a shared vector space. The common practice is to fine-tune the dense retriever via contrastive learning, whose effectiveness highly relies on the quality of the negative sample and the efficacy of mini-batch data. Different from the existing studies that focus on developing sophisticated model architecture, we propose a method to boost data utilization for multilingual dense retrieval by obtaining high-quality hard negative samples and effective mini-batch data. The extensive experimental results on a multilingual retrieval benchmark, MIRACL, with 16 languages demonstrate the effectiveness of our method by outperforming several existing strong baselines.","short_abstract":"Multilingual dense retrieval aims to retrieve relevant documents across different languages based on a unified retriever model. The challenge lies in aligning representations of different languages in a shared vector space. The common practice is to fine-tune the dense retriever via contrastive learning, whose effectiv...","url_abs":"https://arxiv.org/abs/2509.09459","url_pdf":"https://arxiv.org/pdf/2509.09459v1","authors":"[\"Chao Huang\",\"Fengran Mo\",\"Yufeng Chen\",\"Changhao Guan\",\"Zhenrui Yue\",\"Xinyu Wang\",\"Jinan Xu\",\"Kaiyu Huang\"]","published":"2025-09-11T13:42:50Z","proceeding":"cs.IR","tasks":"[\"cs.IR\"]","methods":"[]","has_code":false}