{"ID":3083920,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-07T09:16:17.280914754Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05917","arxiv_id":"2606.05917","title":"MemoryCard: Topic-Aware Multi-Modal Clue Compression for Long-Video Question Answering","abstract":"Long-video question answering remains challenging for Vision-Language Models (VLMs), as answer-relevant evidence is often sparse, transient, and temporally dispersed across lengthy video contexts. Existing frame-centric approaches improve efficiency through uniform sampling, query-aware frame selection, visual-token compression, and adaptive resolution strategies. However, they still rely on isolated and fragmented frames as the fundamental evidence units, limiting VLMs' ability to effectively capture coherent event-level semantics. To address this limitation, we propose MemoryCard, a video-memory-based augmentation framework that organizes long videos into self-contained Memory Cards. Specifically, MemoryCard first performs a self-reading process over videos and aligned utterances to segment the video into semantically coherent units, each corresponding to a distinct topic or event. For each unit, it generates an event-level video gist and selects representative visual moments, which are then rendered into unified Memory Cards for retrieval and question answering. Experimental results demonstrate that MemoryCard consistently improves long-video QA performance under comparable visual-token budgets, achieving up to a 21.8% relative improvement in accuracy. All code is available at https://github.com/NEUIR/MemoryCard.","short_abstract":"Long-video question answering remains challenging for Vision-Language Models (VLMs), as answer-relevant evidence is often sparse, transient, and temporally dispersed across lengthy video contexts. Existing frame-centric approaches improve efficiency through uniform sampling, query-aware frame selection, visual-token co...","url_abs":"https://arxiv.org/abs/2606.05917","url_pdf":"https://arxiv.org/pdf/2606.05917v1","authors":"[\"Qing Yang\",\"Pengcheng Huang\",\"Xinze Li\",\"Zhenghao Liu\",\"Yukun Yan\",\"Yu Gu\",\"Ge Yu\",\"Gang Li\",\"Maosong Sun\"]","published":"2026-06-04T09:23:31Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.CL\"]","methods":"[\"Language Model\",\"Generative Adversarial Network\"]","has_code":false,"code_links":[{"ID":612845,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-05T06:46:15.197025399Z","DeletedAt":null,"paper_id":3083920,"paper_url":"https://arxiv.org/abs/2606.05917","paper_title":"MemoryCard: Topic-Aware Multi-Modal Clue Compression for Long-Video Question Answering","repo_url":"https://github.com/NEUIR/MemoryCard","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
