{"ID":2882766,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.09736","arxiv_id":"2508.09736","title":"Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory","abstract":"We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update episodic and semantic memories, gradually accumulating world knowledge. Its memory is organized in an entity-centric, multimodal manner, enabling deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn reasoning and retrieves relevant memories to complete tasks. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a long-video question answering benchmark comprising 100 newly recorded robot-perspective videos (M3-Bench-robot) and 920 diverse web-sourced videos (M3-Bench-web). We annotate QA pairs designed to test capabilities essential for agent applications, such as person understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances multimodal agents toward more human-like long-term memory and provides insights for their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent.","short_abstract":"We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update episodic and semantic memories, gradually accumulating world knowledge. Its memory is organized in an entity-centric, multimodal manner, en...","url_abs":"https://arxiv.org/abs/2508.09736","url_pdf":"https://arxiv.org/pdf/2508.09736v4","authors":"[\"Lin Long\",\"Yichen He\",\"Wentao Ye\",\"Yiyuan Pan\",\"Yuan Lin\",\"Hang Li\",\"Junbo Zhao\",\"Wei Li\"]","published":"2025-08-13T12:03:03Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Reinforcement Learning\",\"Generative Adversarial Network\"]","has_code":false,"code_links":[{"ID":610922,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2882766,"paper_url":"https://arxiv.org/abs/2508.09736","paper_title":"Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory","repo_url":"https://github.com/bytedance-seed/m3-agent","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
