{"ID":2866609,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.20149","arxiv_id":"2509.20149","title":"Synergistic Enhancement of Requirement-to-Code Traceability: A Framework Combining Large Language Model based Data Augmentation and an Advanced Encoder","abstract":"Automated requirement-to-code traceability link recovery, essential for industrial system quality and safety, is critically hindered by the scarcity of labeled data. To address this bottleneck, this paper proposes and validates a synergistic framework that integrates large language model (LLM)-driven data augmentation with an advanced encoder. We first demonstrate that data augmentation, optimized through a systematic evaluation of bi-directional and zero/few-shot prompting strategies, is highly effective, while the choice among leading LLMs is not a significant performance factor. Building on the augmented data, we further enhance an established, state-of-the-art pre-trained language model based method by incorporating an encoder distinguished by a broader pre-training corpus and an extended context window. Our experiments on four public datasets quantify the distinct contributions of our framework's components: on its own, data augmentation consistently improves the baseline method, providing substantial performance gains of up to 26.66%; incorporating the advanced encoder provides an additional lift of 2.21% to 11.25%. This synergy culminates in a fully optimized framework with maximum gains of up to 28.59% on $F_1$ score and 28.9% on $F_2$ score over the established baseline, decisively outperforming ten established baselines from three dominant paradigms. This work contributes a pragmatic and scalable methodology to overcome the data scarcity bottleneck, paving the way for broader industrial adoption of data-driven requirement-to-code traceability.","short_abstract":"Automated requirement-to-code traceability link recovery, essential for industrial system quality and safety, is critically hindered by the scarcity of labeled data. To address this bottleneck, this paper proposes and validates a synergistic framework that integrates large language model (LLM)-driven data augmentation...","url_abs":"https://arxiv.org/abs/2509.20149","url_pdf":"https://arxiv.org/pdf/2509.20149v2","authors":"[\"Jianzhang Zhang\",\"Jialong Zhou\",\"Nan Niu\",\"Jinping Hua\",\"Chuang Liu\"]","published":"2025-09-24T14:14:21Z","proceeding":"cs.SE","tasks":"[\"cs.SE\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
