{"ID":2867604,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.17562","arxiv_id":"2509.17562","title":"Visual Instruction Pretraining for Domain-Specific Foundation Models","abstract":"Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, this loop remains incomplete: the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is not yet underexplored. This paper addresses this gap by proposing a new paradigm for pretraining foundation models in downstream domains. We introduce Visual insTruction Pretraining (ViTP), a novel approach that directly leverages reasoning to enhance perception. ViTP embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by our proposed Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens. Extensive experiments on 16 challenging remote sensing and medical imaging benchmarks demonstrate that ViTP establishes new state-of-the-art performance across a diverse range of downstream tasks. The code is available at https://github.com/zcablii/ViTP.","short_abstract":"Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, this loop remains incomplete: the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is not yet underexplored. This paper addr...","url_abs":"https://arxiv.org/abs/2509.17562","url_pdf":"https://arxiv.org/pdf/2509.17562v4","authors":"[\"Yuxuan Li\",\"Yicheng Zhang\",\"Wenhao Tang\",\"Yimian Dai\",\"Ming-Ming Cheng\",\"Xiang Li\",\"Jian Yang\"]","published":"2025-09-22T10:57:42Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Vision Transformer\",\"Transformer\",\"Language Model\"]","has_code":false,"code_links":[{"ID":609496,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2867604,"paper_url":"https://arxiv.org/abs/2509.17562","paper_title":"Visual Instruction Pretraining for Domain-Specific Foundation Models","repo_url":"https://github.com/zcablii/ViTP","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
