{"ID":2837986,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.18411","arxiv_id":"2511.18411","title":"SmolKalam: Ensemble Quality-Filtered Translation at Scale for High Quality Arabic Post-Training Data","abstract":"Although the community has tackled the acquisition of high-quality Arabic pretraining data, we still lack large-scale, multi-turn Arabic datasets that include reasoning and tool calling. Naive translation can work at the pretraining scale, but post-training demands much higher quality, which requires a stricter approach to dataset curation. In this work, we introduce SmolKalam, a translation of Smoltalk2 that uses a multi-model ensemble translation pipeline, applies quality filtering, and examines effective translation techniques for traditional decoder-only models through ablations.","short_abstract":"Although the community has tackled the acquisition of high-quality Arabic pretraining data, we still lack large-scale, multi-turn Arabic datasets that include reasoning and tool calling. Naive translation can work at the pretraining scale, but post-training demands much higher quality, which requires a stricter approac...","url_abs":"https://arxiv.org/abs/2511.18411","url_pdf":"https://arxiv.org/pdf/2511.18411v1","authors":"[\"Sultan Alrashed\",\"Chadi Helwe\",\"Francesco Orabona\"]","published":"2025-11-23T11:53:30Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[]","has_code":false}
