{"ID":2835090,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.03086","arxiv_id":"2512.03086","title":"Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation","abstract":"Large language models (LLMs) have shown remarkable capabilities in code translation, yet their performance deteriorates in low-resource programming domains such as Fortran and emerging frameworks like CUDA, where high-quality parallel data are scarce. We present an automated dataset generation pipeline featuring a dual-LLM Questioner-Solver design that incorporates external knowledge from compilers and runtime feedback. Beyond traditional source-target code pair datasets, our approach additionally generates (1) verified translations with unit tests for assessing functional consistency, and (2) multi-turn dialogues that capture the reasoning process behind translation refinement. Applied to Fortran -\u003e C++ and C++ -\u003e CUDA, the pipeline yields 3.64k and 3.93k dialogues, respectively. Fine-tuning on this data yields dramatic improvements in functional correctness, boosting unit test success rates by over 56% on the challenging C++-to-CUDA task. We show this data enables a 7B open-weight model to significantly outperform larger proprietary systems on key metrics like compilation success.","short_abstract":"Large language models (LLMs) have shown remarkable capabilities in code translation, yet their performance deteriorates in low-resource programming domains such as Fortran and emerging frameworks like CUDA, where high-quality parallel data are scarce. We present an automated dataset generation pipeline featuring a dual...","url_abs":"https://arxiv.org/abs/2512.03086","url_pdf":"https://arxiv.org/pdf/2512.03086v1","authors":"[\"Le Chen\",\"Nuo Xu\",\"Winson Chen\",\"Bin Lei\",\"Pei-Hung Lin\",\"Dunzhi Zhou\",\"Rajeev Thakur\",\"Caiwen Ding\",\"Ali Jannesari\",\"Chunhua Liao\"]","published":"2025-11-29T05:26:53Z","proceeding":"cs.PL","tasks":"[\"cs.PL\",\"cs.AI\",\"cs.SE\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
