{"ID":2842130,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.10049","arxiv_id":"2511.10049","title":"Continuous Benchmark Generation for Evaluating Enterprise-scale LLM Agents","abstract":"The rapid adoption of AI agents across domains has made systematic evaluation crucial for ensuring their usefulness and successful production deployment. Evaluation of AI agents typically involves using a fixed set of benchmarks and computing multiple evaluation metrics for the agent. While sufficient for simple coding tasks, these benchmarks fall short for enterprise-scale agents, where services and requirements evolve continuously and ground-truth examples are sparse. We propose a process of benchmark generation that helps evolve the benchmarks as the requirements change and perform robust evaluation of evolving AI agents. We instantiate this approach for a case study of service migration from one deployment platform to another at a large public enterprise. Our approach relies on semi-structured documents where developers express the high-level intent, and uses state-of-the-art LLMs to generate benchmarks from just a small number of such documents. Overall, this process results in a maintainable evaluation framework, enabling rapid feedback on agent performance and facilitating targeted improvements.","short_abstract":"The rapid adoption of AI agents across domains has made systematic evaluation crucial for ensuring their usefulness and successful production deployment. Evaluation of AI agents typically involves using a fixed set of benchmarks and computing multiple evaluation metrics for the agent. While sufficient for simple coding...","url_abs":"https://arxiv.org/abs/2511.10049","url_pdf":"https://arxiv.org/pdf/2511.10049v1","authors":"[\"Divyanshu Saxena\",\"Rishikesh Maurya\",\"Xiaoxuan Ou\",\"Gagan Somashekar\",\"Shachee Mishra Gupta\",\"Arun Iyer\",\"Yu Kang\",\"Chetan Bansal\",\"Aditya Akella\",\"Saravan Rajmohan\"]","published":"2025-11-13T07:48:22Z","proceeding":"cs.SE","tasks":"[\"cs.SE\"]","methods":"[\"Large Language Model\"]","has_code":false}
