{"ID":2891064,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.18762","arxiv_id":"2507.18762","title":"The Role of Orthographic Consistency in Multilingual Embedding Models for Text Classification in Arabic-Script Languages","abstract":"In natural language processing, multilingual models like mBERT and XLM-RoBERTa promise broad coverage but often struggle with languages that share a script yet differ in orthographic norms and cultural context. This issue is especially notable in Arabic-script languages such as Kurdish Sorani, Arabic, Persian, and Urdu. We introduce the Arabic Script RoBERTa (AS-RoBERTa) family: four RoBERTa-based models, each pre-trained on a large corpus tailored to its specific language. By focusing pre-training on language-specific script features and statistics, our models capture patterns overlooked by general-purpose models. When fine-tuned on classification tasks, AS-RoBERTa variants outperform mBERT and XLM-RoBERTa by 2 to 5 percentage points. An ablation study confirms that script-focused pre-training is central to these gains. Error analysis using confusion matrices shows how shared script traits and domain-specific content affect performance. Our results highlight the value of script-aware specialization for languages using the Arabic script and support further work on pre-training strategies rooted in script and language specificity.","short_abstract":"In natural language processing, multilingual models like mBERT and XLM-RoBERTa promise broad coverage but often struggle with languages that share a script yet differ in orthographic norms and cultural context. This issue is especially notable in Arabic-script languages such as Kurdish Sorani, Arabic, Persian, and Urdu...","url_abs":"https://arxiv.org/abs/2507.18762","url_pdf":"https://arxiv.org/pdf/2507.18762v1","authors":"[\"Abdulhady Abas Abdullah\",\"Amir H. Gandomi\",\"Tarik A Rashid\",\"Seyedali Mirjalili\",\"Laith Abualigah\",\"Milena Živković\",\"Hadi Veisi\"]","published":"2025-07-24T19:28:33Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[]","has_code":false}
