{"ID":2827252,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.17909","arxiv_id":"2512.17909","title":"Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing","abstract":"Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents. However, we empirically identify two fundamental obstacles in this paradigm: (1) the discriminative feature space lacks compact regularization, making diffusion models prone to off-manifold latents that lead to inaccurate object structures; and (2) the encoder's inherently weak pixel-level reconstruction hinders the generator from learning accurate fine-grained geometry and texture. In this paper, we propose a systematic framework to adapt understanding-oriented encoder features for generative tasks. We introduce a semantic-pixel reconstruction objective to regularize the latent space, enabling the compression of both semantic information and fine-grained details into a highly compact representation (96 channels with 16x16 spatial downsampling). This design ensures that the latent space remains semantically rich and achieves state-of-the-art image reconstruction, while remaining compact enough for accurate generation. Leveraging this representation, we design a unified Text-to-Image (T2I) and image editing model. Benchmarking against various feature spaces, we demonstrate that our approach achieves state-of-the-art reconstruction, faster convergence, and substantial performance gains in both T2I and editing tasks, validating that representation encoders can be effectively adapted into robust generative components.","short_abstract":"Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to adopt high-dimensional features from representation encoders as generative...","url_abs":"https://arxiv.org/abs/2512.17909","url_pdf":"https://arxiv.org/pdf/2512.17909v1","authors":"[\"Shilong Zhang\",\"He Zhang\",\"Zhifei Zhang\",\"Chongjian Ge\",\"Shuchen Xue\",\"Shaoteng Liu\",\"Mengwei Ren\",\"Soo Ye Kim\",\"Yuqian Zhou\",\"Qing Liu\",\"Daniil Pakhomov\",\"Kai Zhang\",\"Zhe Lin\",\"Ping Luo\"]","published":"2025-12-19T18:59:57Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Diffusion Model\",\"Variational Autoencoder\"]","has_code":false}
