{"ID":2884646,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.06146","arxiv_id":"2508.06146","title":"Text-guided Visual Prompt DINO for Generic Segmentation","abstract":"Recent advancements in multimodal vision models have highlighted limitations in late-stage feature fusion and suboptimal query selection for hybrid prompts open-world segmentation, alongside constraints from caption-derived vocabularies. To address these challenges, we propose Prompt-DINO, a text-guided visual Prompt DINO framework featuring three key innovations. First, we introduce an early fusion mechanism that unifies text/visual prompts and backbone features at the initial encoding stage, enabling deeper cross-modal interactions to resolve semantic ambiguities. Second, we design order-aligned query selection for DETR-based architectures, explicitly optimizing the structural alignment between text and visual queries during decoding to enhance semantic-spatial consistency. Third, we develop a generative data engine powered by the Recognize Anything via Prompting (RAP) model, which synthesizes 0.5B diverse training instances through a dual-path cross-verification pipeline, reducing label noise by 80.5% compared to conventional approaches. Extensive experiments demonstrate that Prompt-DINO achieves state-of-the-art performance on open-world detection benchmarks while significantly expanding semantic coverage beyond fixed-vocabulary constraints. Our work establishes a new paradigm for scalable multimodal detection and data generation in open-world scenarios. Data\u0026Code are available at https://github.com/WeChatCV/WeVisionOne.","short_abstract":"Recent advancements in multimodal vision models have highlighted limitations in late-stage feature fusion and suboptimal query selection for hybrid prompts open-world segmentation, alongside constraints from caption-derived vocabularies. To address these challenges, we propose Prompt-DINO, a text-guided visual Prompt D...","url_abs":"https://arxiv.org/abs/2508.06146","url_pdf":"https://arxiv.org/pdf/2508.06146v1","authors":"[\"Yuchen Guan\",\"Chong Sun\",\"Canmiao Fu\",\"Zhipeng Huang\",\"Chun Yuan\",\"Chen Li\"]","published":"2025-08-08T09:09:30Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false,"code_links":[{"ID":611095,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2884646,"paper_url":"https://arxiv.org/abs/2508.06146","paper_title":"Text-guided Visual Prompt DINO for Generic Segmentation","repo_url":"https://github.com/WeChatCV/WeVisionOne","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}