{"ID":2839156,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.16528","arxiv_id":"2511.16528","title":"TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval","abstract":"Neural information retrieval systems excel in high-resource languages but remain underexplored for morphologically rich, lower-resource languages such as Turkish. Dense bi-encoders currently dominate Turkish IR, yet late-interaction models -- which retain token-level representations for fine-grained matching -- have not been systematically evaluated. We introduce TurkColBERT, the first comprehensive benchmark comparing dense encoders and late-interaction models for Turkish retrieval. Our two-stage adaptation pipeline fine-tunes English and multilingual encoders on Turkish NLI/STS tasks, then converts them into ColBERT-style retrievers using PyLate trained on MS MARCO-TR. We evaluate 10 models across five Turkish BEIR datasets covering scientific, financial, and argumentative domains. Results show strong parameter efficiency: the 1.0M-parameter colbert-hash-nano-tr is 600$\\times$ smaller than the 600M turkish-e5-large dense encoder while preserving over 71\\% of its average mAP. Late-interaction models that are 3--5$\\times$ smaller than dense encoders significantly outperform them; ColmmBERT-base-TR yields up to +13.8\\% mAP on domain-specific tasks. For production-readiness, we compare indexing algorithms: MUVERA+Rerank is 3.33$\\times$ faster than PLAID and offers +1.7\\% relative mAP gain. This enables low-latency retrieval, with ColmmBERT-base-TR achieving 0.54 ms query times under MUVERA. We release all checkpoints, configs, and evaluation scripts. Limitations include reliance on moderately sized datasets ($\\leq$50K documents) and translated benchmarks, which may not fully reflect real-world Turkish retrieval conditions; larger-scale MUVERA evaluations remain necessary.","short_abstract":"Neural information retrieval systems excel in high-resource languages but remain underexplored for morphologically rich, lower-resource languages such as Turkish. Dense bi-encoders currently dominate Turkish IR, yet late-interaction models -- which retain token-level representations for fine-grained matching -- have no...","url_abs":"https://arxiv.org/abs/2511.16528","url_pdf":"https://arxiv.org/pdf/2511.16528v1","authors":"[\"Özay Ezerceli\",\"Mahmoud El Hussieni\",\"Selva Taş\",\"Reyhan Bayraktar\",\"Fatma Betül Terzioğlu\",\"Yusuf Çelebi\",\"Yağız Asker\"]","published":"2025-11-20T16:42:21Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.IR\"]","methods":"[]","has_code":false}
