{"ID":2864223,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.23736","arxiv_id":"2509.23736","title":"HieraTok: Multi-Scale Visual Tokenizer Improves Image Reconstruction and Generation","abstract":"In this work, we present HieraTok, a novel multi-scale Vision Transformer (ViT)-based tokenizer that overcomes the inherent limitation of modeling single-scale representations. This is realized through two key designs: (1) multi-scale downsampling applied to the token map generated by the tokenizer encoder, producing a sequence of multi-scale tokens, and (2) a scale-causal attention mechanism that enables the progressive flow of information from low-resolution global semantic features to high-resolution structural details. Coupling these designs, HieraTok achieves significant improvements in both image reconstruction and generation tasks. Under identical settings, the multi-scale visual tokenizer outperforms its single-scale counterpart by a 27.2\\% improvement in rFID ($1.47 \\rightarrow 1.07$). When integrated into downstream generation frameworks, it achieves a $1.38\\times$ faster convergence rate and an 18.9\\% boost in gFID ($16.4 \\rightarrow 13.3$), which may be attributed to the smoother and more uniformly distributed latent space. Furthermore, by scaling up the tokenizer's training, we demonstrate its potential by a sota rFID of 0.45 and a gFID of 1.82 among ViT tokenizers. To the best of our knowledge, we are the first to introduce multi-scale ViT-based tokenizer in image reconstruction and image generation. We hope our findings and designs advance the ViT-based tokenizers in visual generation tasks.","short_abstract":"In this work, we present HieraTok, a novel multi-scale Vision Transformer (ViT)-based tokenizer that overcomes the inherent limitation of modeling single-scale representations. This is realized through two key designs: (1) multi-scale downsampling applied to the token map generated by the tokenizer encoder, producing a...","url_abs":"https://arxiv.org/abs/2509.23736","url_pdf":"https://arxiv.org/pdf/2509.23736v1","authors":"[\"Cong Chen\",\"Ziyuan Huang\",\"Cheng Zou\",\"Muzhi Zhu\",\"Kaixiang Ji\",\"Jiajia Liu\",\"Jingdong Chen\",\"Hao Chen\",\"Chunhua Shen\"]","published":"2025-09-28T08:30:26Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Vision Transformer\",\"Transformer\"]","has_code":false}
