{"ID":2868965,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.16197","arxiv_id":"2509.16197","title":"MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer","abstract":"Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.","short_abstract":"Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces t...","url_abs":"https://arxiv.org/abs/2509.16197","url_pdf":"https://arxiv.org/pdf/2509.16197v1","authors":"[\"Yanghao Li\",\"Rui Qian\",\"Bowen Pan\",\"Haotian Zhang\",\"Haoshuo Huang\",\"Bowen Zhang\",\"Jialing Tong\",\"Haoxuan You\",\"Xianzhi Du\",\"Zhe Gan\",\"Hyunjik Kim\",\"Chao Jia\",\"Zhenbang Wang\",\"Yinfei Yang\",\"Mingfei Gao\",\"Zi-Yi Dou\",\"Wenze Hu\",\"Chang Gao\",\"Dongxu Li\",\"Philipp Dufter\",\"Zirui Wang\",\"Guoli Yin\",\"Zhengdong Zhang\",\"Chen Chen\",\"Yang Zhao\",\"Ruoming Pang\",\"Zhifeng Chen\"]","published":"2025-09-19T17:58:00Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.CL\",\"cs.LG\"]","methods":"[\"Diffusion Model\",\"Large Language Model\",\"Language Model\"]","has_code":false}
