{"ID":2921992,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-02T06:31:28.386602592Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.00562","arxiv_id":"2606.00562","title":"DeepLatent: Think with Images via Parallel Latent Visual Reasoning","abstract":"The emerging paradigm of \"thinking with images\" embeds visual states into intermediate reasoning steps, defining a new frontier for Vision-Language Models. Existing approaches diverge along two lines. Tool-assisted methods apply explicit visual operations but suffer from high latency and restricted manipulation types. Latent reasoning methods autoregressively produce implicit visual states, but underperform tool-assisted methods, and their latent tokens fail to capture effective visual information. In this work, we propose DeepLatent, a parallel framework for latent visual reasoning. First, we introduce LatentFormer. It uses learnable 2D tokens to generate context-conditioned latent states in parallel, anchoring every visual update directly in the original image features. Second, we design a continuous-space reinforcement learning algorithm. It optimizes latent modulation parameters directly in the embedding space, significantly improving latent representation quality. The framework is trained via knowledge distillation followed by this continuous-space RL algorithm. Furthermore, we contribute DeepLatent-180K, a large-scale dataset tailored for latent visual reasoning. Extensive evaluations across multiple benchmarks demonstrate that DeepLatent achieves state-of-the-art performance.","short_abstract":"The emerging paradigm of \"thinking with images\" embeds visual states into intermediate reasoning steps, defining a new frontier for Vision-Language Models. Existing approaches diverge along two lines. Tool-assisted methods apply explicit visual operations but suffer from high latency and restricted manipulation types....","url_abs":"https://arxiv.org/abs/2606.00562","url_pdf":"https://arxiv.org/pdf/2606.00562v1","authors":"[\"Dongchen Lu\",\"Zhimo Li\",\"Mao Shu\",\"Huo Cao\"]","published":"2026-05-30T06:33:24Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.LG\"]","methods":"[\"Reinforcement Learning\",\"Language Model\"]","has_code":false}
