{"ID":2882622,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.09486","arxiv_id":"2508.09486","title":"Video-EM: Event-Centric Episodic Memory for Long-Form Video Understanding","abstract":"Video Large Language Models (Video-LLMs) have shown strong video understanding, yet their application to long-form videos remains constrained by limited context windows. A common workaround is to compress long videos into a handful of representative frames via retrieval or summarization. However, most existing pipelines score frames in isolation, implicitly assuming that frame-level saliency is sufficient for downstream reasoning. This often yields redundant selections, fragmented temporal evidence, and weakened narrative grounding for long-form video question answering. We present \\textbf{Video-EM}, a training-free, event-centric episodic memory framework that reframes long-form VideoQA as \\emph{episodic event construction} followed by \\emph{memory refinement}. Instead of treating retrieved keyframes as independent visuals, Video-EM employs an LLM as an active memory agent to orchestrate off-the-shelf tools: it first localizes query-relevant moments via multi-grained semantic matching, then groups and segments them into temporally coherent events, and finally encodes each event as a grounded episodic memory with explicit temporal indices and spatio-temporal cues (capturing \\emph{when}, \\emph{where}, \\emph{what}, and involved entities). To further suppress verbosity and noise from imperfect upstream signals, Video-EM integrates a reasoning-driven self-reflection loop that iteratively verifies evidence sufficiency and cross-event consistency, removes redundancy, and adaptively adjusts event granularity. The outcome is a compact yet reliable \\emph{event timeline} -- a minimal but sufficient episodic memory set that can be directly consumed by existing Video-LLMs without additional training or architectural changes.","short_abstract":"Video Large Language Models (Video-LLMs) have shown strong video understanding, yet their application to long-form videos remains constrained by limited context windows. A common workaround is to compress long videos into a handful of representative frames via retrieval or summarization. However, most existing pipeline...","url_abs":"https://arxiv.org/abs/2508.09486","url_pdf":"https://arxiv.org/pdf/2508.09486v2","authors":"[\"Yun Wang\",\"Long Zhang\",\"Jingren Liu\",\"Jiaqi Yan\",\"Zhanjie Zhang\",\"Jiahao Zheng\",\"Ao Ma\",\"Run Ling\",\"Xun Yang\",\"Dapeng Wu\",\"Xiangyu Chen\",\"Xuelong Li\"]","published":"2025-08-13T04:33:07Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\",\"cs.MM\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
