{"ID":2824915,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.21842","arxiv_id":"2512.21842","title":"AlignAR: Generative Sentence Alignment for Arabic-English Parallel Corpora of Legal and Literary Texts","abstract":"High-quality parallel corpora are essential for Machine Translation (MT) research and translation teaching. However, Arabic-English resources remain scarce and existing datasets mainly consist of simple one-to-one mappings. In this paper, we present AlignAR, a generative sentence alignment method, and a new Arabic-English dataset comprising simple legal and complex literary parallel texts. Our evaluation demonstrates that \"Easy\" datasets lack the discriminatory power to fully assess alignment methods. By reducing one-to-one mappings in our \"Hard\" subset, we exposed the limitations of traditional alignment methods. In contrast, LLM-based approaches demonstrated better robustness, achieving an overall F1-score of 85.5%, a nearly 9% improvement over previous methods. Our datasets and codes are open-sourced at https://github.com/XXX.","short_abstract":"High-quality parallel corpora are essential for Machine Translation (MT) research and translation teaching. However, Arabic-English resources remain scarce and existing datasets mainly consist of simple one-to-one mappings. In this paper, we present AlignAR, a generative sentence alignment method, and a new Arabic-Engl...","url_abs":"https://arxiv.org/abs/2512.21842","url_pdf":"https://arxiv.org/pdf/2512.21842v2","authors":"[\"Baorong Huang\",\"Ali Asiri\"]","published":"2025-12-26T03:10:43Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\"]","has_code":false}
