{"ID":3006173,"CreatedAt":"2026-06-03T03:09:48.883664427Z","UpdatedAt":"2026-06-04T19:14:31.964469513Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.03027","arxiv_id":"2606.03027","title":"SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia","abstract":"Text embeddings are fundamental to many downstream applications, making robustness important for real-world NLP. However, most recent state-of-the-art embedding models are not reproducible because they rely on closed or undisclosed training data, and they remain insufficiently robust for Southeast Asian languages. We present SEA-Embedding, a fully open and reproducible text-embedding pipeline for Southeast Asian languages trained only on publicly available data, and use it to study three core factors of robust embedding design: data composition, training objective, and base encoder initialization. SEA-Embedding achieves state-of-the-art results on SEA-BED while enabling systematic and reproducible analysis of robust text embeddings for the region.","short_abstract":"Text embeddings are fundamental to many downstream applications, making robustness important for real-world NLP. However, most recent state-of-the-art embedding models are not reproducible because they rely on closed or undisclosed training data, and they remain insufficiently robust for Southeast Asian languages. We p...","url_abs":"https://arxiv.org/abs/2606.03027","url_pdf":"https://arxiv.org/pdf/2606.03027v1","authors":"[\"Peerat Limkonchotiwat\",\"Raymond Ng\",\"Sarana Nutanong\",\"Jian Gang Ngui\"]","published":"2026-06-02T02:05:14Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[]","has_code":false}
