{"ID":2835399,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.22973","arxiv_id":"2511.22973","title":"BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation","abstract":"Generating minute-long videos is a critical step toward developing world models, providing a foundation for realistic extended scenes and advanced AI simulators. The emerging semi-autoregressive (block diffusion) paradigm integrates the strengths of diffusion and autoregressive models, enabling arbitrary-length video generation and improving inference efficiency through KV caching and parallel sampling. However, it yet faces two enduring challenges: (i) KV-cache-induced long-horizon error accumulation, and (ii) the lack of fine-grained long-video benchmarks and coherence-aware metrics. To overcome these limitations, we propose BlockVid, a novel block diffusion framework equipped with semantic-aware sparse KV cache, an effective training strategy called Block Forcing, and dedicated chunk-wise noise scheduling and shuffling to reduce error propagation and enhance temporal consistency. We further introduce LV-Bench, a fine-grained benchmark for minute-long videos, complete with new metrics evaluating long-range coherence. Extensive experiments on VBench and LV-Bench demonstrate that BlockVid consistently outperforms existing methods in generating high-quality, coherent minute-long videos. In particular, it achieves a 22.2% improvement on VDE Subject and a 19.4% improvement on VDE Clarity in LV-Bench over the state of the art approaches. Project website: https://ziplab.co/BlockVid. Inferix (Code): https://github.com/alibaba-damo-academy/Inferix.","short_abstract":"Generating minute-long videos is a critical step toward developing world models, providing a foundation for realistic extended scenes and advanced AI simulators. The emerging semi-autoregressive (block diffusion) paradigm integrates the strengths of diffusion and autoregressive models, enabling arbitrary-length video g...","url_abs":"https://arxiv.org/abs/2511.22973","url_pdf":"https://arxiv.org/pdf/2511.22973v1","authors":"[\"Zeyu Zhang\",\"Shuning Chang\",\"Yuanyu He\",\"Yizeng Han\",\"Jiasheng Tang\",\"Fan Wang\",\"Bohan Zhuang\"]","published":"2025-11-28T08:25:59Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Diffusion Model\"]","project_urls":"[\"https://ziplab.co/BlockVid\"]","has_code":false,"code_links":[{"ID":606509,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2835399,"paper_url":"https://arxiv.org/abs/2511.22973","paper_title":"BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation","repo_url":"https://github.com/alibaba-damo-academy/Inferix","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
