{"ID":2861649,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.02295","arxiv_id":"2510.02295","title":"VideoNSA: Native Sparse Attention Scales Video Understanding","abstract":"Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks. Project Page: https://enxinsong.com/VideoNSA-web/, Code: https://github.com/Espere-1119-Song/VideoNSA","short_abstract":"Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-t...","url_abs":"https://arxiv.org/abs/2510.02295","url_pdf":"https://arxiv.org/pdf/2510.02295v2","authors":"[\"Enxin Song\",\"Wenhao Chai\",\"Shusheng Yang\",\"Ethan Armand\",\"Xiaojun Shan\",\"Haiyang Xu\",\"Jianwen Xie\",\"Zhuowen Tu\"]","published":"2025-10-02T17:58:54Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Language Model\"]","project_urls":"[\"https://enxinsong.com/VideoNSA-web/\"]","has_code":false,"code_links":[{"ID":608827,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2861649,"paper_url":"https://arxiv.org/abs/2510.02295","paper_title":"VideoNSA: Native Sparse Attention Scales Video Understanding","repo_url":"https://github.com/Espere-1119-Song/VideoNSA","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
