{"ID":2834509,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2601.14263","arxiv_id":"2601.14263","title":"Call2Instruct: Automated Pipeline for Generating Q\u0026A Datasets from Call Center Recordings for LLM Fine-Tuning","abstract":"The adaptation of Large-Scale Language Models (LLMs) to specific domains depends on high-quality fine-tuning datasets, particularly in instructional format (e.g., Question-Answer - Q\u0026A). However, generating these datasets, particularly from unstructured sources such as call center audio recordings, poses a significant challenge due to the noisy and disorganized nature of the data. This paper presents a solution to this challenge by offering an end-to-end automated pipeline for generating Q\u0026A instructional datasets from such recordings. The methodology developed comprises sequential steps of audio processing (including diarization, noise removal and automatic transcription), textual processing (cleaning, normalization, and anonymization), semantic extraction of customer demands and attendant responses using vector embeddings, and matching via semantic search to form the final Q\u0026A pairs. As a result, the complete pipeline was successfully implemented, generating a dataset specifically formatted for Instruct Fine Tuning. The practical value and feasibility of the generated dataset were substantiated and functionally demonstrated through the successful fine-tuning of an LLM model (based on Llama 2 7B). The conclusion of the paper states that the proposed approach is viable for converting unstructured conversational data from call centers into valuable resources for training LLMs. This development has the potential to open up avenues for creating more effective AI systems for Q\u0026A tasks in the customer service domain. The developed codes have been made publicly available to promote reproducibility and future research.","short_abstract":"The adaptation of Large-Scale Language Models (LLMs) to specific domains depends on high-quality fine-tuning datasets, particularly in instructional format (e.g., Question-Answer - Q\u0026A). However, generating these datasets, particularly from unstructured sources such as call center audio recordings, poses a significant...","url_abs":"https://arxiv.org/abs/2601.14263","url_pdf":"https://arxiv.org/pdf/2601.14263v1","authors":"[\"Alex Echeverria\",\"Sávio Salvarino Teles de Oliveira\",\"Fernando Marques Federson\"]","published":"2025-12-01T13:39:54Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.CL\",\"cs.HC\",\"cs.SD\",\"eess.AS\"]","methods":"[\"Large Language Model\",\"Language Model\",\"Generative Adversarial Network\"]","has_code":false}