{"ID":2863856,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.25162","arxiv_id":"2509.25162","title":"AlignTok: Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models","abstract":"In this work, we propose aligning pretrained visual encoders to serve as tokenizers for latent diffusion models in image generation. Unlike training a variational autoencoder (VAE) from scratch, which primarily emphasizes low-level details, our approach leverages the rich semantic structure of foundation encoders. We introduce a three-stage alignment strategy called AlignTok: (1) freeze the encoder and train an adapter and a decoder to establish a semantic latent space; (2) jointly optimize all components with an additional semantic preservation loss, enabling the encoder to capture perceptual details while retaining high-level semantics; and (3) refine the decoder for improved reconstruction quality. This alignment yields semantically rich image tokenizers that benefit diffusion models. On ImageNet 256$\\times$256, our tokenizer accelerates the convergence of diffusion models, reaching a gFID of 1.90 within just 64 epochs, and improves generation both with and without classifier-free guidance. Scaling to LAION, text-to-image models trained with our tokenizer consistently outperforms FLUX VAE and VA-VAE under the same training steps. Overall, our method is simple, scalable, and establishes a semantically grounded paradigm for continuous tokenizer design.","short_abstract":"In this work, we propose aligning pretrained visual encoders to serve as tokenizers for latent diffusion models in image generation. Unlike training a variational autoencoder (VAE) from scratch, which primarily emphasizes low-level details, our approach leverages the rich semantic structure of foundation encoders. We i...","url_abs":"https://arxiv.org/abs/2509.25162","url_pdf":"https://arxiv.org/pdf/2509.25162v2","authors":"[\"Bowei Chen\",\"Sai Bi\",\"Hao Tan\",\"He Zhang\",\"Tianyuan Zhang\",\"Zhengqi Li\",\"Yuanjun Xiong\",\"Jianming Zhang\",\"Kai Zhang\"]","published":"2025-09-29T17:57:39Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Diffusion Model\",\"Variational Autoencoder\"]","has_code":false}
