{"ID":2832534,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.05665","arxiv_id":"2512.05665","title":"Interleaved Latent Visual Reasoning with Selective Perceptual Modeling","abstract":"Interleaved reasoning paradigms enhance Multimodal Large Language Models (MLLMs) with visual feedback but are hindered by the prohibitive computational cost of re-encoding pixel-dense images. A promising alternative, latent visual reasoning, circumvents this bottleneck yet faces limitations: methods either fail to capture intermediate state evolution due to single-step, non-interleaved structures, or sacrifice precise perceptual modeling by over-compressing features. We introduce Interleaved Latent Visual Reasoning (ILVR), a framework that unifies dynamic state evolution with precise perceptual modeling. ILVR interleaves textual generation with latent visual representations that act as specific, evolving cues for subsequent reasoning. Specifically, we employ a self-supervision strategy where a momentum teacher model selectively distills relevant features from ground-truth intermediate images into sparse supervision targets. This adaptive selection mechanism guides the model to autonomously generate context-aware visual signals. Extensive experiments on multimodal reasoning benchmarks demonstrate that ILVR outperforms existing approaches, effectively bridging the gap between fine-grained perception and sequential multimodal reasoning. The code is available at https://github.com/XD111ds/ILVR.","short_abstract":"Interleaved reasoning paradigms enhance Multimodal Large Language Models (MLLMs) with visual feedback but are hindered by the prohibitive computational cost of re-encoding pixel-dense images. A promising alternative, latent visual reasoning, circumvents this bottleneck yet faces limitations: methods either fail to capt...","url_abs":"https://arxiv.org/abs/2512.05665","url_pdf":"https://arxiv.org/pdf/2512.05665v3","authors":"[\"Shuai Dong\",\"Siyuan Wang\",\"Xingyu Liu\",\"Chenglin Li\",\"Haowen Hou\",\"Zhongyu Wei\"]","published":"2025-12-05T12:09:39Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.CV\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":606253,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2832534,"paper_url":"https://arxiv.org/abs/2512.05665","paper_title":"Interleaved Latent Visual Reasoning with Selective Perceptual Modeling","repo_url":"https://github.com/XD111ds/ILVR","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}