{"ID":2868579,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.15568","arxiv_id":"2509.15568","title":"LiteLong: Resource-Efficient Long-Context Data Synthesis for LLMs","abstract":"High-quality long-context data is essential for training large language models (LLMs) capable of processing extensive documents, yet existing synthesis approaches using relevance-based aggregation face challenges of computational efficiency. We present LiteLong, a resource-efficient method for synthesizing long-context data through structured topic organization and multi-agent debate. Our approach leverages the BISAC book classification system to provide a comprehensive hierarchical topic organization, and then employs a debate mechanism with multiple LLMs to generate diverse, high-quality topics within this structure. For each topic, we use lightweight BM25 retrieval to obtain relevant documents and concatenate them into 128K-token training samples. Experiments on HELMET and Ruler benchmarks demonstrate that LiteLong achieves competitive long-context performance and can seamlessly integrate with other long-dependency enhancement methods. LiteLong makes high-quality long-context data synthesis more accessible by reducing both computational and data engineering costs, facilitating further research in long-context language training.","short_abstract":"High-quality long-context data is essential for training large language models (LLMs) capable of processing extensive documents, yet existing synthesis approaches using relevance-based aggregation face challenges of computational efficiency. We present LiteLong, a resource-efficient method for synthesizing long-context...","url_abs":"https://arxiv.org/abs/2509.15568","url_pdf":"https://arxiv.org/pdf/2509.15568v1","authors":"[\"Junlong Jia\",\"Xing Wu\",\"Chaochen Gao\",\"Ziyang Chen\",\"Zijia Lin\",\"Zhongzhi Li\",\"Weinong Wang\",\"Haotian Xu\",\"Donghui Jin\",\"Debing Zhang\",\"Binghui Guo\"]","published":"2025-09-19T04:07:46Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\",\"Generative Adversarial Network\"]","has_code":false}
