{"ID":2846266,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.02652","arxiv_id":"2511.02652","title":"Differentiable Hierarchical Visual Tokenization","abstract":"Vision Transformers rely on fixed patch tokens that ignore the spatial and semantic structure of images. In this work, we introduce an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained models. Our method uses hierarchical model selection with information criteria to provide competitive performance in both image-level classification and dense-prediction tasks, and even supports out-of-the-box raster-to-vector conversion.","short_abstract":"Vision Transformers rely on fixed patch tokens that ignore the spatial and semantic structure of images. In this work, we introduce an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained...","url_abs":"https://arxiv.org/abs/2511.02652","url_pdf":"https://arxiv.org/pdf/2511.02652v1","authors":"[\"Marius Aasan\",\"Martine Hjelkrem-Tan\",\"Nico Catalano\",\"Changkyu Choi\",\"Adín Ramírez Rivera\"]","published":"2025-11-04T15:18:29Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Vision Transformer\",\"Transformer\"]","has_code":false}