{"ID":3084821,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-07T03:38:11.424509713Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05661","arxiv_id":"2606.05661","title":"Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments","abstract":"Continual learning, the ability of AI systems to improve through sequential experience, has attracted substantial interest, but no high-quality benchmark exists to evaluate it. We introduce Continual Learning Bench (CL-Bench), the first difficult, expert-validated benchmark designed to measure whether LLM-based systems genuinely improve with experience. CL-Bench spans six diverse domains (software engineering, signal processing, disease outbreak forecasting, database querying, strategic game-playing, and demand forecasting), each validated by domain experts and designed so that tasks share a learnable latent structure (codebase layout, disease outbreak dynamics, opponent strategies) that a stateful system can discover online but a stateless one cannot. We evaluate frontier models across several agent architectures, from naive in-context learning (ICL) to dedicated memory systems, introducing a gain metric to isolate learning from prior capabilities. We find that these systems leave headroom for improved continual learning: agents frequently overfit to immediate observations or fail to reuse knowledge across instances, and dedicated memory systems do not fix this -- in fact, naive ICL outperforms systems dedicated to memory management. CL-Bench is the first benchmark to evaluate continual learning across diverse real-world domains with expert-validated tasks and isolate online learning from underlying model capability, showing a need for better continual learning systems.","short_abstract":"Continual learning, the ability of AI systems to improve through sequential experience, has attracted substantial interest, but no high-quality benchmark exists to evaluate it. We introduce Continual Learning Bench (CL-Bench), the first difficult, expert-validated benchmark designed to measure whether LLM-based systems...","url_abs":"https://arxiv.org/abs/2606.05661","url_pdf":"https://arxiv.org/pdf/2606.05661v1","authors":"[\"Parth Asawa\",\"Christopher M. Glaze\",\"Gabriel Orlanski\",\"Ramya Ramakrishnan\",\"Benji Xu\",\"Asim Biswal\",\"Vincent Sunn Chen\",\"Frederic Sala\",\"Matei Zaharia\",\"Joseph E. Gonzalez\"]","published":"2026-06-04T03:43:28Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CL\"]","methods":"[\"Large Language Model\"]","has_code":false}