{"ID":2831277,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.08777","arxiv_id":"2512.08777","title":"Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages","abstract":"We propose a post-training method for lower-resource languages that preserves the fluency of language models even when aligned by disfluent reward models. Preference optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese. Lower-resource languages lack both datasets written by native speakers and instruction-tuned language models capable of generating fluent synthetic data. To address this, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language. Our approach uses an on-policy training method, which we compare with two common alternatives: supervised finetuning on machine-translated data and multilingual finetuning. We conduct a case study on Norwegian Bokmål and evaluate fluency through native-speaker assessments. The results show that the on-policy aspect is crucial and outperforms the alternatives without relying on any hard-to-obtain data.","short_abstract":"We propose a post-training method for lower-resource languages that preserves the fluency of language models even when aligned by disfluent reward models. Preference optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese. Lower-resource languages lack both dat...","url_abs":"https://arxiv.org/abs/2512.08777","url_pdf":"https://arxiv.org/pdf/2512.08777v2","authors":"[\"David Samuel\",\"Lilja Øvrelid\",\"Erik Velldal\",\"Andrey Kutuzov\"]","published":"2025-12-09T16:31:48Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Language Model\"]","has_code":false}