{"ID":2836675,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.20714","arxiv_id":"2511.20714","title":"Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation","abstract":"World models serve as core simulators for fields such as agentic AI, embodied AI, and gaming, capable of generating long, physically realistic, and interactive high-quality videos. Moreover, scaling these models could unlock emergent capabilities in visual perception, understanding, and reasoning, paving the way for a new paradigm that moves beyond current LLM-centric vision foundation models. A key breakthrough empowering them is the semi-autoregressive (block-diffusion) decoding paradigm, which merges the strengths of diffusion and autoregressive methods by generating video tokens in block-applying diffusion within each block while conditioning on previous ones, resulting in more coherent and stable video sequences. Crucially, it overcomes limitations of standard video diffusion by reintroducing LLM-style KV Cache management, enabling efficient, variable-length, and high-quality generation. Therefore, Inferix is specifically designed as a next-generation inference engine to enable immersive world synthesis through optimized semi-autoregressive decoding processes. This dedicated focus on world simulation distinctly sets it apart from systems engineered for high-concurrency scenarios (like vLLM or SGLang) and from classic video diffusion models (such as xDiTs). Inferix further enhances its offering with interactive video streaming and profiling, enabling real-time interaction and realistic simulation to accurately model world dynamics. Additionally, it supports efficient benchmarking through seamless integration of LV-Bench, a new fine-grained evaluation benchmark tailored for minute-long video generation scenarios. We hope the community will work together to advance Inferix and foster world model exploration.","short_abstract":"World models serve as core simulators for fields such as agentic AI, embodied AI, and gaming, capable of generating long, physically realistic, and interactive high-quality videos. Moreover, scaling these models could unlock emergent capabilities in visual perception, understanding, and reasoning, paving the way for a...","url_abs":"https://arxiv.org/abs/2511.20714","url_pdf":"https://arxiv.org/pdf/2511.20714v2","authors":"[\"Inferix Team\",\"Tianyu Feng\",\"Yizeng Han\",\"Jiahao He\",\"Yuanyu He\",\"Xi Lin\",\"Teng Liu\",\"Hanfeng Lu\",\"Jiasheng Tang\",\"Wei Wang\",\"Zhiyuan Wang\",\"Jichao Wu\",\"Mingyang Yang\",\"Yinghao Yu\",\"Zeyu Zhang\",\"Bohan Zhuang\"]","published":"2025-11-25T01:45:04Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Diffusion Model\",\"Large Language Model\",\"LoRA\"]","has_code":false}
