{"ID":2873720,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.05949","arxiv_id":"2509.05949","title":"AttriPrompt: Dynamic Prompt Composition Learning for CLIP","abstract":"The evolution of prompt learning methodologies has driven exploration of deeper prompt designs to enhance model performance. However, current deep text prompting approaches suffer from two critical limitations: Over-reliance on constrastive learning objectives that prioritize high-level semantic alignment, neglecting fine-grained feature optimization; Static prompts across all input categories, preventing content-aware adaptation. To address these limitations, we propose AttriPrompt-a novel framework that enhances and refines textual semantic representations by leveraging the intermediate-layer features of CLIP's vision encoder. We designed an Attribute Retrieval module that first clusters visual features from each layer. The aggregated visual features retrieve semantically similar prompts from a prompt pool, which are then concatenated to the input of every layer in the text encoder. Leveraging hierarchical visual information embedded in prompted text features, we introduce Dual-stream Contrastive Learning to realize fine-grained alignment. Furthermore, we introduce a Self-Regularization mechanism by applying explicit regularization constraints between the prompted and non-prompted text features to prevent overfitting on limited training data. Extensive experiments across three benchmarks demonstrate AttriPrompt's superiority over state-of-the-art methods, achieving up to 7.37\\% improvement in the base-to-novel setting. The observed strength of our method in cross-domain knowledge transfer positions vision-language pre-trained models as more viable solutions for real-world implementation.","short_abstract":"The evolution of prompt learning methodologies has driven exploration of deeper prompt designs to enhance model performance. However, current deep text prompting approaches suffer from two critical limitations: Over-reliance on constrastive learning objectives that prioritize high-level semantic alignment, neglecting f...","url_abs":"https://arxiv.org/abs/2509.05949","url_pdf":"https://arxiv.org/pdf/2509.05949v1","authors":"[\"Qiqi Zhan\",\"Shiwei Li\",\"Qingjie Liu\",\"Yunhong Wang\"]","published":"2025-09-07T07:07:59Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"LoRA\"]","has_code":false}
