{"ID":2848851,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.23981","arxiv_id":"2510.23981","title":"TeleEgo: Benchmarking Egocentric AI Assistants in the Wild","abstract":"Egocentric AI assistants in real-world settings must process multi-modal inputs (video, audio, text), respond in real time, and retain evolving long-term memory. However, existing benchmarks typically evaluate these abilities in isolation, lack realistic streaming scenarios, or support only short-term tasks. We introduce \\textbf{TeleEgo}, a long-duration, streaming, omni-modal benchmark for evaluating egocentric AI assistants in realistic daily contexts. The dataset features over 14 hours per participant of synchronized egocentric video, audio, and text across four domains: work \\\u0026 study, lifestyle \\\u0026 routines, social activities, and outings \\\u0026 culture. All data is aligned on a unified global timeline and includes high-quality visual narrations and speech transcripts, curated through human refinement.TeleEgo defines 12 diagnostic subtasks across three core capabilities: Memory (recalling past events), Understanding (interpreting the current moment), and Cross-Memory Reasoning (linking distant events). It contains 3,291 human-verified QA items spanning multiple question formats (single-choice, binary, multi-choice, and open-ended), evaluated strictly in a streaming setting. We propose Real-Time Accuracy (RTA) to jointly capture correctness and responsiveness under tight decision windows, and Memory Persistence Time (MPT) as a forward-looking metric for long-term retention in continuous streams. In this work, we report RTA results for current models and release TeleEgo, together with an MPT evaluation framework, as a realistic and extensible benchmark for future egocentric assistants with stronger streaming memory, enabling systematic study of both real-time behavior and long-horizon memory.","short_abstract":"Egocentric AI assistants in real-world settings must process multi-modal inputs (video, audio, text), respond in real time, and retain evolving long-term memory. However, existing benchmarks typically evaluate these abilities in isolation, lack realistic streaming scenarios, or support only short-term tasks. We introdu...","url_abs":"https://arxiv.org/abs/2510.23981","url_pdf":"https://arxiv.org/pdf/2510.23981v4","authors":"[\"Jiaqi Yan\",\"Ruilong Ren\",\"Jingren Liu\",\"Shuning Xu\",\"Ling Wang\",\"Yiheng Wang\",\"Xinlin Zhong\",\"Yun Wang\",\"Long Zhang\",\"Xiangyu Chen\",\"Changzhi Sun\",\"Jixiang Luo\",\"Dell Zhang\",\"Hao Sun\",\"Chi Zhang\",\"Xuelong Li\"]","published":"2025-10-28T01:24:24Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false}
