{"ID":2895662,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.08480","arxiv_id":"2507.08480","title":"Improving Korean-English Cross-Lingual Retrieval: A Data-Centric Study of Language Composition and Model Merging","abstract":"With the increasing utilization of multilingual text information, Cross-Lingual Information Retrieval (CLIR) has become a crucial research area. However, the impact of training data composition on both CLIR and Mono-Lingual Information Retrieval (IR) performance remains under-explored. To systematically investigate this data-centric aspect, we construct linguistically parallel Korean-English datasets and train retrieval models with various language combinations. Our experiments reveal that the language composition of training data significantly influences IR performance, exhibiting important inter-lingual correlations: CLIR performance improves with specific language pairs, while Mono-Lingual IR performance declines. Our work demonstrates that Model Merging can effectively mitigate this trade-off, achieving strong CLIR results while preserving Mono-Lingual IR capabilities. Our findings underscore the effects of linguistic configuration of training data on both CLIR and Mono-Lingual IR, and present Model Merging as a viable strategy to optimize performance across these tasks.","short_abstract":"With the increasing utilization of multilingual text information, Cross-Lingual Information Retrieval (CLIR) has become a crucial research area. However, the impact of training data composition on both CLIR and Mono-Lingual Information Retrieval (IR) performance remains under-explored. To systematically investigate thi...","url_abs":"https://arxiv.org/abs/2507.08480","url_pdf":"https://arxiv.org/pdf/2507.08480v2","authors":"[\"Youngjoon Jang\",\"Junyoung Son\",\"Taemin Lee\",\"Seongtae Hong\",\"Hyeonseok Moon\",\"Seungyoon Lee\",\"Andrew Matteson\",\"Heuiseok Lim\"]","published":"2025-07-11T10:44:09Z","proceeding":"cs.IR","tasks":"[\"cs.IR\"]","methods":"[]","has_code":false}
