{"ID":2844965,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.05299","arxiv_id":"2511.05299","title":"LiveStar: Live Streaming Assistant for Real-World Online Video Understanding","abstract":"Despite significant progress in Video Large Language Models (Video-LLMs) for offline video understanding, existing online Video-LLMs typically struggle to simultaneously process continuous frame-by-frame inputs and determine optimal response timing, often compromising real-time responsiveness and narrative coherence. To address these limitations, we introduce LiveStar, a pioneering live streaming assistant that achieves always-on proactive responses through adaptive streaming decoding. Specifically, LiveStar incorporates: (1) a training strategy enabling incremental video-language alignment for variable-length video streams, preserving temporal consistency across dynamically evolving frame sequences; (2) a response-silence decoding framework that determines optimal proactive response timing via a single forward pass verification; (3) memory-aware acceleration via peak-end memory compression for online inference on 10+ minute videos, combined with streaming key-value cache to achieve 1.53x faster inference. We also construct an OmniStar dataset, a comprehensive dataset for training and benchmarking that encompasses 15 diverse real-world scenarios and 5 evaluation tasks for online video understanding. Extensive experiments across three benchmarks demonstrate LiveStar's state-of-the-art performance, achieving an average 19.5% improvement in semantic correctness with 18.1% reduced timing difference compared to existing online Video-LLMs, while improving FPS by 12.0% across all five OmniStar tasks. Our model and dataset can be accessed at https://github.com/yzy-bupt/LiveStar.","short_abstract":"Despite significant progress in Video Large Language Models (Video-LLMs) for offline video understanding, existing online Video-LLMs typically struggle to simultaneously process continuous frame-by-frame inputs and determine optimal response timing, often compromising real-time responsiveness and narrative coherence. T...","url_abs":"https://arxiv.org/abs/2511.05299","url_pdf":"https://arxiv.org/pdf/2511.05299v1","authors":"[\"Zhenyu Yang\",\"Kairui Zhang\",\"Yuhang Hu\",\"Bing Wang\",\"Shengsheng Qian\",\"Bin Wen\",\"Fan Yang\",\"Tingting Gao\",\"Weiming Dong\",\"Changsheng Xu\"]","published":"2025-11-07T15:00:37Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":607334,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2844965,"paper_url":"https://arxiv.org/abs/2511.05299","paper_title":"LiveStar: Live Streaming Assistant for Real-World Online Video Understanding","repo_url":"https://github.com/yzy-bupt/LiveStar","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}