{"ID":2883104,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.08680","arxiv_id":"2508.08680","title":"TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation","abstract":"LLMs have been shown to perform well in machine translation (MT) with the use of in-context learning (ICL), rivaling supervised models when translating into high-resource languages (HRLs). However, they lag behind when translating into low-resource language (LRLs). Example selection via similarity search and supervised fine-tuning help. However the improvements they give are limited by the size, quality and diversity of existing parallel datasets. A common technique in low-resource MT is synthetic parallel data creation, the most frequent of which is backtranslation, whereby existing target-side texts are automatically translated into the source language. However, this assumes the existence of good quality and relevant target-side texts, which are not readily available for many LRLs. In this paper, we present \\textsc{TopXGen}, an LLM-based approach for the generation of high quality and topic-diverse data in multiple LRLs, which can then be backtranslated to produce useful and diverse parallel texts for ICL and fine-tuning. Our intuition is that while LLMs struggle to translate into LRLs, their ability to translate well into HRLs and their multilinguality enable them to generate good quality, natural-sounding target-side texts, which can be translated well into a high-resource source language. We show that \\textsc{TopXGen} boosts LLM translation performance during fine-tuning and in-context learning. Code and outputs are available at https://github.com/ArmelRandy/topxgen.","short_abstract":"LLMs have been shown to perform well in machine translation (MT) with the use of in-context learning (ICL), rivaling supervised models when translating into high-resource languages (HRLs). However, they lag behind when translating into low-resource language (LRLs). Example selection via similarity search and supervised...","url_abs":"https://arxiv.org/abs/2508.08680","url_pdf":"https://arxiv.org/pdf/2508.08680v1","authors":"[\"Armel Zebaze\",\"Benoît Sagot\",\"Rachel Bawden\"]","published":"2025-08-12T06:58:02Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\"]","has_code":false,"code_links":[{"ID":610951,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2883104,"paper_url":"https://arxiv.org/abs/2508.08680","paper_title":"TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation","repo_url":"https://github.com/ArmelRandy/topxgen","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
