{"ID":2854248,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.15857","arxiv_id":"2510.15857","title":"BLIP3o-NEXT: Next Frontier of Native Image Generation","abstract":"We present BLIP3o-NEXT, a fully open-source foundation model in the BLIP3 series that advances the next frontier of native image generation. BLIP3o-NEXT unifies text-to-image generation and image editing within a single architecture, demonstrating strong image generation and image editing capabilities. In developing the state-of-the-art native image generation model, we identify four key insights: (1) Most architectural choices yield comparable performance; an architecture can be deemed effective provided it scales efficiently and supports fast inference; (2) The successful application of reinforcement learning can further push the frontier of native image generation; (3) Image editing still remains a challenging task, yet instruction following and the consistency between generated and reference images can be significantly enhanced through post-training and data engine; (4) Data quality and scale continue to be decisive factors that determine the upper bound of model performance. Building upon these insights, BLIP3o-NEXT leverages an Autoregressive + Diffusion architecture in which an autoregressive model first generates discrete image tokens conditioned on multimodal inputs, whose hidden states are then used as conditioning signals for a diffusion model to generate high-fidelity images. This architecture integrates the reasoning strength and instruction following of autoregressive models with the fine-detail rendering ability of diffusion models, achieving a new level of coherence and realism. Extensive evaluations of various text-to-image and image-editing benchmarks show that BLIP3o-NEXT achieves superior performance over existing models.","short_abstract":"We present BLIP3o-NEXT, a fully open-source foundation model in the BLIP3 series that advances the next frontier of native image generation. BLIP3o-NEXT unifies text-to-image generation and image editing within a single architecture, demonstrating strong image generation and image editing capabilities. In developing th...","url_abs":"https://arxiv.org/abs/2510.15857","url_pdf":"https://arxiv.org/pdf/2510.15857v1","authors":"[\"Jiuhai Chen\",\"Le Xue\",\"Zhiyang Xu\",\"Xichen Pan\",\"Shusheng Yang\",\"Can Qin\",\"An Yan\",\"Honglu Zhou\",\"Zeyuan Chen\",\"Lifu Huang\",\"Tianyi Zhou\",\"Junnan Li\",\"Silvio Savarese\",\"Caiming Xiong\",\"Ran Xu\"]","published":"2025-10-17T17:50:58Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Reinforcement Learning\",\"Diffusion Model\"]","has_code":false}