{"ID":2829765,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.11388","arxiv_id":"2512.11388","title":"Improving Translation Quality by Selecting Better Data for LLM Fine-Tuning: A Comparative Analysis","abstract":"We investigated the impact of data selection on machine translation fine-tuning for open LLMs. Using Japanese-English corpora, we compare five selectors: TF-IDF, COMET Kiwi, QuRate, FD-Score, and random selection, under controlled training conditions. We observed that semantic selectors consistently outperform lexical and geometry-based heuristics, and that even when the selected data differ by less than 3%, the impact on model performance is substantial, underscoring the sensitivity of fine-tuning to data quality.","short_abstract":"We investigated the impact of data selection on machine translation fine-tuning for open LLMs. Using Japanese-English corpora, we compare five selectors: TF-IDF, COMET Kiwi, QuRate, FD-Score, and random selection, under controlled training conditions. We observed that semantic selectors consistently outperform lexical...","url_abs":"https://arxiv.org/abs/2512.11388","url_pdf":"https://arxiv.org/pdf/2512.11388v1","authors":"[\"Felipe Ribeiro Fujita de Mello\",\"Hideyuki Takada\"]","published":"2025-12-12T08:59:27Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\"]","has_code":false}
