{"ID":2895213,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.09601","arxiv_id":"2507.09601","title":"NMIXX: Domain-Adapted Neural Embeddings for Cross-Lingual eXploration of Finance","abstract":"General-purpose sentence embedding models often struggle to capture specialized financial semantics, especially in low-resource languages like Korean, due to domain-specific jargon, temporal meaning shifts, and misaligned bilingual vocabularies. To address these gaps, we introduce NMIXX (Neural eMbeddings for Cross-lingual eXploration of Finance), a suite of cross-lingual embedding models fine-tuned with 18.8K high-confidence triplets that pair in-domain paraphrases, hard negatives derived from a semantic-shift typology, and exact Korean-English translations. Concurrently, we release KorFinSTS, a 1,921-pair Korean financial STS benchmark spanning news, disclosures, research reports, and regulations, designed to expose nuances that general benchmarks miss. When evaluated against seven open-license baselines, NMIXX's multilingual bge-m3 variant achieves Spearman's rho gains of +0.10 on English FinSTS and +0.22 on KorFinSTS, outperforming its pre-adaptation checkpoint and surpassing other models by the largest margin, while revealing a modest trade-off in general STS performance. Our analysis further shows that models with richer Korean token coverage adapt more effectively, underscoring the importance of tokenizer design in low-resource, cross-lingual settings. By making both models and the benchmark publicly available, we provide the community with robust tools for domain-adapted, multilingual representation learning in finance.","short_abstract":"General-purpose sentence embedding models often struggle to capture specialized financial semantics, especially in low-resource languages like Korean, due to domain-specific jargon, temporal meaning shifts, and misaligned bilingual vocabularies. To address these gaps, we introduce NMIXX (Neural eMbeddings for Cross-lin...","url_abs":"https://arxiv.org/abs/2507.09601","url_pdf":"https://arxiv.org/pdf/2507.09601v2","authors":"[\"Hanwool Lee\",\"Sara Yu\",\"Yewon Hwang\",\"Jonghyun Choi\",\"Heejae Ahn\",\"Sungbum Jung\",\"Youngjae Yu\"]","published":"2025-07-13T12:14:57Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"q-fin.CP\"]","methods":"[\"LoRA\"]","has_code":false}
