{"ID":2867463,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.17317","arxiv_id":"2509.17317","title":"Scaling, Simplification, and Adaptation: Lessons from Pretraining on Machine-Translated Text","abstract":"Most languages lack sufficient data for large-scale monolingual pretraining, creating a \"data wall.\" Multilingual pretraining helps but is limited by language imbalance and the \"curse of multilinguality.\" An alternative is to translate high-resource text with machine translation (MT), which raises three questions: (1) How does MT-derived data scale with model capacity? (2) Can source-side transformations (e.g., simplifying English with an LLM) improve generalization to native text? (3) How well do models pretrained on MT-derived data adapt when continually trained on limited native text? We investigate these questions by translating English into Indonesian and Tamil--two typologically distant, lower-resource languages--and pretraining GPT-2 models (124M-774M) on native or MT-derived corpora from raw and LLM-simplified English. We evaluate cross-entropy loss on native text, along with accuracy on syntactic probes and downstream tasks. Our results show that (1) MT-pretrained models benefit from scaling; (2) source-side simplification harms generalization to native text; and (3) adapting MT-pretrained models on native text often yields better performance than native-only models, even with less native data. However, tasks requiring cultural nuance (e.g., toxicity detection) demand more exposure to native data.","short_abstract":"Most languages lack sufficient data for large-scale monolingual pretraining, creating a \"data wall.\" Multilingual pretraining helps but is limited by language imbalance and the \"curse of multilinguality.\" An alternative is to translate high-resource text with machine translation (MT), which raises three questions: (1)...","url_abs":"https://arxiv.org/abs/2509.17317","url_pdf":"https://arxiv.org/pdf/2509.17317v1","authors":"[\"Dan John Velasco\",\"Matthew Theodore Roque\"]","published":"2025-09-22T02:48:43Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Large Language Model\"]","has_code":false}
