{"ID":2873670,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.07295","arxiv_id":"2509.07295","title":"Reconstruction Alignment Improves Unified Multimodal Models","abstract":"Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details--even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense \"text prompts,\" providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73$\\rightarrow$0.90) and DPGBench (80.93$\\rightarrow$88.15), while also boosting editing benchmarks (ImgEdit 3.38$\\rightarrow$3.75, GEdit 6.94$\\rightarrow$7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs","short_abstract":"Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details--even when they use hundreds of words to describe a simple image. We...","url_abs":"https://arxiv.org/abs/2509.07295","url_pdf":"https://arxiv.org/pdf/2509.07295v3","authors":"[\"Ji Xie\",\"Trevor Darrell\",\"Luke Zettlemoyer\",\"XuDong Wang\"]","published":"2025-09-08T23:59:32Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Diffusion Model\"]","has_code":false}
