{"ID":2829140,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.13687","arxiv_id":"2512.13687","title":"Towards Scalable Pre-training of Visual Tokenizers for Generation","abstract":"The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poorly to improved performance in generation. We identify this as the ``pre-training scaling problem`` and suggest a necessary shift: to be effective for generation, a latent space must concisely represent high-level semantics. We present VTP, a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses. Our large-scale study reveals two principal findings: (1) understanding is a key driver of generation, and (2) much better scaling properties, where generative performance scales effectively with compute, parameters, and data allocated to the pretraining of the visual tokenizer. After large-scale pre-training, our tokenizer delivers a competitive profile (78.2 zero-shot accuracy and 0.36 rFID on ImageNet) and 4.1 times faster convergence on generation compared to advanced distillation methods. More importantly, it scales effectively: without modifying standard DiT training specs, solely investing more FLOPS in pretraining VTP achieves 65.8\\% FID improvement in downstream generation, while conventional autoencoder stagnates very early at 1/10 FLOPS. Our pre-trained models are available at https://github.com/MiniMax-AI/VTP.","short_abstract":"The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-...","url_abs":"https://arxiv.org/abs/2512.13687","url_pdf":"https://arxiv.org/pdf/2512.13687v2","authors":"[\"Jingfeng Yao\",\"Yuda Song\",\"Yucong Zhou\",\"Xinggang Wang\"]","published":"2025-12-15T18:59:54Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Variational Autoencoder\"]","has_code":false,"code_links":[{"ID":605934,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2829140,"paper_url":"https://arxiv.org/abs/2512.13687","paper_title":"Towards Scalable Pre-training of Visual Tokenizers for Generation","repo_url":"https://github.com/MiniMax-AI/VTP","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
