{"ID":2852026,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.18269","arxiv_id":"2510.18269","title":"StreamingTOM: Streaming Token Compression for Efficient Video Understanding","abstract":"Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks. Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens, ensuring predictable latency. Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them, keeping the active kv-cache bounded regardless of stream length. Experiments demonstrate our method achieves $15.7\\times$ kv-cache compression ratio; compared to prior SOTA (LiveVLM), it delivers $1.2\\times$ lower peak memory and $2\\times$ faster TTFT. StreamingTOM achieves state-of-the-art accuracy among training-free methods with an average of $63.8\\%$ on offline benchmarks and $55.8\\%$ accuracy and $3.7$ score on RVS. These results demonstrate that real-time streaming video understanding with bounded active memory is achievable without model retraining.","short_abstract":"Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only...","url_abs":"https://arxiv.org/abs/2510.18269","url_pdf":"https://arxiv.org/pdf/2510.18269v2","authors":"[\"Xueyi Chen\",\"Keda Tao\",\"Kele Shao\",\"Huan Wang\"]","published":"2025-10-21T03:39:41Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
