{"ID":2847488,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.27246","arxiv_id":"2510.27246","title":"Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs","abstract":"Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coherence, cover narrow domains, and only test simple recall-oriented tasks. This paper introduces a comprehensive solution to these challenges. First, we present a novel framework for automatically generating long (up to 10M tokens), coherent, and topically diverse conversations, accompanied by probing questions targeting a wide range of memory abilities. From this, we construct BEAM, a new benchmark comprising 100 conversations and 2,000 validated questions. Second, to enhance model performance, we propose LIGHT-a framework inspired by human cognition that equips LLMs with three complementary memory systems: a long-term episodic memory, a short-term working memory, and a scratchpad for accumulating salient facts. Our experiments on BEAM reveal that even LLMs with 1M token context windows (with and without retrieval-augmentation) struggle as dialogues lengthen. In contrast, LIGHT consistently improves performance across various models, achieving an average improvement of 3.5%-12.69% over the strongest baselines, depending on the backbone LLM. An ablation study further confirms the contribution of each memory component.","short_abstract":"Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coherence, cover narrow domains, and only test simple recall-oriented tasks. This p...","url_abs":"https://arxiv.org/abs/2510.27246","url_pdf":"https://arxiv.org/pdf/2510.27246v2","authors":"[\"Mohammad Tavakoli\",\"Alireza Salemi\",\"Carrie Ye\",\"Mohamed Abdalla\",\"Hamed Zamani\",\"J Ross Mitchell\"]","published":"2025-10-31T07:29:52Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.IR\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
