{"ID":2896931,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.05750","arxiv_id":"2507.05750","title":"DocTalk: Scalable Graph-based Dialogue Synthesis for Enhancing LLM Conversational Capabilities","abstract":"Large Language Models (LLMs) are increasingly employed in multi-turn conversational tasks, yet their pre-training data predominantly consists of continuous prose, creating a potential mismatch between required capabilities and training paradigms. We introduce a novel approach to address this discrepancy by synthesizing conversational data from existing text corpora. We present a pipeline that transforms a cluster of multiple related documents into an extended multi-turn, multi-topic information-seeking dialogue. Applying our pipeline to Wikipedia articles, we curate DocTalk, a multi-turn pre-training dialogue corpus consisting of over 730k long conversations. We hypothesize that exposure to such synthesized conversational structures during pre-training can enhance the fundamental multi-turn capabilities of LLMs, such as context memory and understanding. Empirically, we show that incorporating DocTalk during pre-training results in up to 40% gain in context memory and understanding, without compromising base performance. DocTalk is available at https://huggingface.co/datasets/AmazonScience/DocTalk.","short_abstract":"Large Language Models (LLMs) are increasingly employed in multi-turn conversational tasks, yet their pre-training data predominantly consists of continuous prose, creating a potential mismatch between required capabilities and training paradigms. We introduce a novel approach to address this discrepancy by synthesizing...","url_abs":"https://arxiv.org/abs/2507.05750","url_pdf":"https://arxiv.org/pdf/2507.05750v1","authors":"[\"Jing Yang Lee\",\"Hamed Bonab\",\"Nasser Zalmout\",\"Ming Zeng\",\"Sanket Lokegaonkar\",\"Colin Lockard\",\"Binxuan Huang\",\"Ritesh Sarkhel\",\"Haodong Wang\"]","published":"2025-07-08T07:52:12Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
