{"ID":2875115,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.03498","arxiv_id":"2509.03498","title":"OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation","abstract":"We introduce OneCAT, a unified multimodal model that seamlessly integrates understanding, generation, and editing within a novel, pure decoder-only transformer architecture. Our framework uniquely eliminates the need for external components such as Vision Transformers (ViT) or vision tokenizer during inference, leading to significant efficiency gains, especially for high-resolution inputs. This is achieved through a modality-specific Mixture-of-Experts (MoE) structure trained with a single autoregressive (AR) objective, which also natively supports dynamic resolutions. Furthermore, we pioneer a multi-scale visual autoregressive mechanism within the Large Language Model (LLM) that drastically reduces decoding steps compared to diffusion-based methods while maintaining state-of-the-art performance. Our findings demonstrate the powerful potential of pure autoregressive modeling as a sufficient and elegant foundation for unified multimodal intelligence. As a result, OneCAT sets a new performance standard, outperforming existing open-source unified multimodal models across benchmarks for multimodal generation, editing, and understanding.","short_abstract":"We introduce OneCAT, a unified multimodal model that seamlessly integrates understanding, generation, and editing within a novel, pure decoder-only transformer architecture. Our framework uniquely eliminates the need for external components such as Vision Transformers (ViT) or vision tokenizer during inference, leading...","url_abs":"https://arxiv.org/abs/2509.03498","url_pdf":"https://arxiv.org/pdf/2509.03498v3","authors":"[\"Han Li\",\"Xinyu Peng\",\"Yaoming Wang\",\"Zelin Peng\",\"Xin Chen\",\"Rongxiang Weng\",\"Jingang Wang\",\"Xunliang Cai\",\"Wenrui Dai\",\"Hongkai Xiong\"]","published":"2025-09-03T17:29:50Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Vision Transformer\",\"Diffusion Model\",\"Transformer\",\"Large Language Model\",\"Language Model\",\"Generative Adversarial Network\"]","has_code":false}
