{"ID":2840992,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.12594","arxiv_id":"2511.12594","title":"Seg-VAR: Image Segmentation with Visual Autoregressive Modeling","abstract":"While visual autoregressive modeling (VAR) strategies have shed light on image generation with the autoregressive models, their potential for segmentation, a task that requires precise low-level spatial perception, remains unexplored. Inspired by the multi-scale modeling of classic Mask2Former-based models, we propose Seg-VAR, a novel framework that rethinks segmentation as a conditional autoregressive mask generation problem. This is achieved by replacing the discriminative learning with the latent learning process. Specifically, our method incorporates three core components: (1) an image encoder generating latent priors from input images, (2) a spatial-aware seglat (a latent expression of segmentation mask) encoder that maps segmentation masks into discrete latent tokens using a location-sensitive color mapping to distinguish instances, and (3) a decoder reconstructing masks from these latents. A multi-stage training strategy is introduced: first learning seglat representations via image-seglat joint training, then refining latent transformations, and finally aligning image-encoder-derived latents with seglat distributions. Experiments show Seg-VAR outperforms previous discriminative and generative methods on various segmentation tasks and validation benchmarks. By framing segmentation as a sequential hierarchical prediction task, Seg-VAR opens new avenues for integrating autoregressive reasoning into spatial-aware vision systems. Code will be available at https://github.com/rkzheng99/Seg-VAR.","short_abstract":"While visual autoregressive modeling (VAR) strategies have shed light on image generation with the autoregressive models, their potential for segmentation, a task that requires precise low-level spatial perception, remains unexplored. Inspired by the multi-scale modeling of classic Mask2Former-based models, we propose...","url_abs":"https://arxiv.org/abs/2511.12594","url_pdf":"https://arxiv.org/pdf/2511.12594v1","authors":"[\"Rongkun Zheng\",\"Lu Qi\",\"Xi Chen\",\"Yi Wang\",\"Kun Wang\",\"Hengshuang Zhao\"]","published":"2025-11-16T13:36:19Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false,"code_links":[{"ID":607017,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2840992,"paper_url":"https://arxiv.org/abs/2511.12594","paper_title":"Seg-VAR: Image Segmentation with Visual Autoregressive Modeling","repo_url":"https://github.com/rkzheng99/Seg-VAR","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}