{"ID":2837095,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.20565","arxiv_id":"2511.20565","title":"DINO-Tok: Adapting DINO for Visual Tokenizers","abstract":"Recent advances in visual generation have emphasized the importance of Latent Generative Models (LGMs), which critically depend on effective visual tokenizers to bridge pixels and semantic representations. However, tokenizers constructed on pre-trained vision foundation models (VFMs) often struggle to balance semantic richness and reconstruction fidelity in high-dimensional latent spaces. In this paper, we introduce DINO-Tok, a visual tokenizer built upon a frozen DINO encoder that supports both continuous autoencoding (DINO-Tok-AE) and discrete vector-quantization (DINO-Tok-VQ). By unifying hierarchical representations from both shallow fine-grained features and deep global semantics into an information-complete latent space, DINO-Tok preserves texture details while maintaining \\textit{semantic consistency} for generation. We further investigate VQ in frozen semantic feature spaces of high dimensionality, where information dilution and codebook collapse frequently arise. To address this issue, we propose Dominant-Subspace Quantization (DSQ), which leverages a global PCA analysis to select principal components while suppressing noisy dimensions, thereby stabilizing codebook optimization and improving reconstruction and generation quality. On ImageNet 256x256, DINO-Tok achieves strong reconstruction performance, achieving 0.28 rFID for continuous autoencoding and 1.10 rFID for discrete VQ, as well as strong few-step generation performance 1.82 gFID for diffusion and 2.44 gFID for autoregressive generation. These results demonstrate that pre-trained VFMs such as DINO can be directly adapted into high-fidelity, semantically aligned visual tokenizers for next-generation latent generative models. Code will be publicly available at https://github.com/MKJia/DINO-Tok.","short_abstract":"Recent advances in visual generation have emphasized the importance of Latent Generative Models (LGMs), which critically depend on effective visual tokenizers to bridge pixels and semantic representations. However, tokenizers constructed on pre-trained vision foundation models (VFMs) often struggle to balance semantic...","url_abs":"https://arxiv.org/abs/2511.20565","url_pdf":"https://arxiv.org/pdf/2511.20565v2","authors":"[\"Mingkai Jia\",\"Mingxiao Li\",\"Zhijian Shu\",\"Anlin Zheng\",\"Liaoyuan Fan\",\"Jiaxin Guo\",\"Tianxing Shi\",\"Dongyue Lu\",\"Zeming Li\",\"Xiaoyang Guo\",\"Xiaojuan Qi\",\"Xiao-Xiao Long\",\"Qian Zhang\",\"Ping Tan\",\"Wei Yin\"]","published":"2025-11-25T18:00:00Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Diffusion Model\"]","has_code":false,"code_links":[{"ID":606654,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2837095,"paper_url":"https://arxiv.org/abs/2511.20565","paper_title":"DINO-Tok: Adapting DINO for Visual Tokenizers","repo_url":"https://github.com/MKJia/DINO-Tok","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}