{"ID":3053383,"CreatedAt":"2026-06-04T04:41:36.695875263Z","UpdatedAt":"2026-06-06T04:22:55.119378026Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.04433","arxiv_id":"2606.04433","title":"Stateful Visual Encoders for Vision-Language Models","abstract":"Vision-language models (VLMs) are increasingly used in multi-image, multi-turn agentic settings where decisions depend on visual changes. However, in existing open-weight VLMs, visual comparisons happen only inside the language model, while the visual encoder itself remains stateless: each image is encoded independently, without access to the prior visual context. As a result, small but task-critical changes may be attenuated before the language model has a chance to compare them, especially when those changes do not affect the high-level semantics of the scene. We introduce a Stateful Visual Encoder, which conditions each visual representation on prior visual features. Under supervised finetuning, VLMs equipped with stateful encoders achieve consistent improvements on controlled tasks involving cross-image spatial aggregation, multi-object visual differencing, and visual trajectory behavior cloning. These improvements are consistent across input resolutions, language model sizes, and VLM backbones. Finally, we validate our model on real-world tasks, including longitudinal radiology, fine-grained image comparison, and remote sensing, where stateful encoders consistently improve generalist VLM baselines and can match or surpass specialized models in selected domains. Project page: https://statefulvisualencoders.github.io/","short_abstract":"Vision-language models (VLMs) are increasingly used in multi-image, multi-turn agentic settings where decisions depend on visual changes. However, in existing open-weight VLMs, visual comparisons happen only inside the language model, while the visual encoder itself remains stateless: each image is encoded independentl...","url_abs":"https://arxiv.org/abs/2606.04433","url_pdf":"https://arxiv.org/pdf/2606.04433v1","authors":"[\"Zirui Wang\",\"Junwei Yu\",\"Adam Yala\",\"David M. Chan\",\"Joseph E. Gonzalez\",\"Trevor Darrell\"]","published":"2026-06-03T04:31:15Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.CL\",\"cs.LG\"]","methods":"[\"Language Model\"]","has_code":false}
