{"ID":2872286,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.09614","arxiv_id":"2509.09614","title":"LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering","abstract":"The emergence of long-context language models with context windows extending to millions of tokens has created new opportunities for sophisticated code understanding and software development evaluation. We propose LoCoBench, a comprehensive benchmark specifically designed to evaluate long-context LLMs in realistic, complex software development scenarios. Unlike existing code evaluation benchmarks that focus on single-function completion or short-context tasks, LoCoBench addresses the critical evaluation gap for long-context capabilities that require understanding entire codebases, reasoning across multiple files, and maintaining architectural consistency across large-scale software systems. Our benchmark provides 8,000 evaluation scenarios systematically generated across 10 programming languages, with context lengths spanning 10K to 1M tokens, a 100x variation that enables precise assessment of long-context performance degradation in realistic software development settings. LoCoBench introduces 8 task categories that capture essential long-context capabilities: architectural understanding, cross-file refactoring, multi-session development, bug investigation, feature implementation, code comprehension, integration testing, and security analysis. Through a 5-phase pipeline, we create diverse, high-quality scenarios that challenge LLMs to reason about complex codebases at unprecedented scale. We introduce a comprehensive evaluation framework with 17 metrics across 4 dimensions, including 8 new evaluation metrics, combined in a LoCoBench Score (LCBS). Our evaluation of state-of-the-art long-context models reveals substantial performance gaps, demonstrating that long-context understanding in complex software development represents a significant unsolved challenge that demands more attention. LoCoBench is released at: https://github.com/SalesforceAIResearch/LoCoBench.","short_abstract":"The emergence of long-context language models with context windows extending to millions of tokens has created new opportunities for sophisticated code understanding and software development evaluation. We propose LoCoBench, a comprehensive benchmark specifically designed to evaluate long-context LLMs in realistic, com...","url_abs":"https://arxiv.org/abs/2509.09614","url_pdf":"https://arxiv.org/pdf/2509.09614v1","authors":"[\"Jielin Qiu\",\"Zuxin Liu\",\"Zhiwei Liu\",\"Rithesh Murthy\",\"Jianguo Zhang\",\"Haolin Chen\",\"Shiyu Wang\",\"Ming Zhu\",\"Liangwei Yang\",\"Juntao Tan\",\"Zhepeng Cen\",\"Cheng Qian\",\"Shelby Heinecke\",\"Weiran Yao\",\"Silvio Savarese\",\"Caiming Xiong\",\"Huan Wang\"]","published":"2025-09-11T16:55:04Z","proceeding":"cs.SE","tasks":"[\"cs.SE\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":609959,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2872286,"paper_url":"https://arxiv.org/abs/2509.09614","paper_title":"LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering","repo_url":"https://github.com/SalesforceAIResearch/LoCoBench","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
