{"ID":2888472,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.00945","arxiv_id":"2508.00945","title":"Optimizing Vision-Language Consistency via Cross-Layer Regional Attention Alignment","abstract":"Vision Language Models (VLMs) face challenges in effectively coordinating diverse attention mechanisms for cross-modal embedding learning, leading to mismatched attention and suboptimal performance. We propose Consistent Cross-layer Regional Alignment (CCRA), which introduces Layer-Patch-wise Cross Attention (LPWCA) to capture fine-grained regional-semantic correlations by jointly weighting patch and layer-wise embedding, and Progressive Attention Integration (PAI) that systematically coordinates LPWCA, layer-wise, and patch-wise attention mechanisms in sequence. This progressive design ensures consistency from semantic to regional levels while preventing attention drift and maximizing individual attention benefits. Experimental results on ten diverse vision-language benchmarks demonstrate that our CCRA-enhanced LLaVA-v1.5-7B model achieves state-of-the-art performance, outperforming all baseline methods with only 3.55M additional parameters, while providing enhanced interpretability through more regionally focused and semantically aligned attention patterns.","short_abstract":"Vision Language Models (VLMs) face challenges in effectively coordinating diverse attention mechanisms for cross-modal embedding learning, leading to mismatched attention and suboptimal performance. We propose Consistent Cross-layer Regional Alignment (CCRA), which introduces Layer-Patch-wise Cross Attention (LPWCA) to...","url_abs":"https://arxiv.org/abs/2508.00945","url_pdf":"https://arxiv.org/pdf/2508.00945v1","authors":"[\"Yifan Wang\",\"Hongfeng Ai\",\"Quangao Liu\",\"Maowei Jiang\",\"Ruiyuan Kang\",\"Ruiqi Li\",\"Jiahua Dong\",\"Mengting Xiao\",\"Cheng Jiang\",\"Chenzhong Li\"]","published":"2025-07-31T17:14:55Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false}
