{"ID":2848415,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.25804","arxiv_id":"2510.25804","title":"Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data","abstract":"Long-context language models unlock advanced capabilities in reasoning, code generation, and document summarization by leveraging dependencies across extended spans of text. However, a significant portion of readily available long-text data lacks meaningful long-distance dependencies; most spans can be predicted using only local context. Training on such data is inefficient, making careful data selection crucial. Therefore, we introduce LongFilter, a framework for curating training data tailored to long-context pretraining. LongFilter measures the information gain provided by extended context by contrasting model predictions under long-context versus short-context settings, thereby identifying samples where long-range dependencies are essential. Experiments with LLaMA-3-8B, extending its context length from 8K to 64K, show that LongFilter efficiently selects high-quality data and yields substantial improvements on benchmarks such as HELMET, LongBench, and RULER.","short_abstract":"Long-context language models unlock advanced capabilities in reasoning, code generation, and document summarization by leveraging dependencies across extended spans of text. However, a significant portion of readily available long-text data lacks meaningful long-distance dependencies; most spans can be predicted using...","url_abs":"https://arxiv.org/abs/2510.25804","url_pdf":"https://arxiv.org/pdf/2510.25804v1","authors":"[\"Haoran Deng\",\"Yingyu Lin\",\"Zhenghao Lin\",\"Xiao Liu\",\"Yizhou Sun\",\"Yi-An Ma\",\"Yeyun Gong\"]","published":"2025-10-29T06:21:08Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
