{"ID":2851489,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.19217","arxiv_id":"2510.19217","title":"Modality Matching Matters: Calibrating Language Distances for Cross-Lingual Transfer in URIEL+","abstract":"Existing linguistic knowledge bases such as URIEL+ provide valuable geographic, genetic and typological distances for cross-lingual transfer but suffer from two key limitations. First, their one-size-fits-all vector representations are ill-suited to the diverse structures of linguistic data. Second, they lack a principled method for aggregating these signals into a single, comprehensive score. In this paper, we address these gaps by introducing a framework for type-matched language distances. We propose novel, structure-aware representations for each distance type: speaker-weighted distributions for geography, hyperbolic embeddings for genealogy, and a latent variables model for typology. We unify these signals into a robust, task-agnostic composite distance. Across multiple zero-shot transfer benchmarks, we demonstrate that our representations significantly improve transfer performance when the distance type is relevant to the task, while our composite distance yields gains in most tasks.","short_abstract":"Existing linguistic knowledge bases such as URIEL+ provide valuable geographic, genetic and typological distances for cross-lingual transfer but suffer from two key limitations. First, their one-size-fits-all vector representations are ill-suited to the diverse structures of linguistic data. Second, they lack a princip...","url_abs":"https://arxiv.org/abs/2510.19217","url_pdf":"https://arxiv.org/pdf/2510.19217v3","authors":"[\"York Hay Ng\",\"Aditya Khan\",\"Xiang Lu\",\"Matteo Salloum\",\"Michael Zhou\",\"Phuong H. Hoang\",\"A. Seza Doğruöz\",\"En-Shiun Annie Lee\"]","published":"2025-10-22T03:59:19Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[]","has_code":false}
