{"ID":2895387,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.09184","arxiv_id":"2507.09184","title":"MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models","abstract":"Hallucinations pose a significant challenge in Large Vision Language Models (LVLMs), with misalignment between multimodal features identified as a key contributing factor. This paper reveals the negative impact of the long-term decay in Rotary Position Encoding (RoPE), used for positional modeling in LVLMs, on multimodal alignment. Concretely, under long-term decay, instruction tokens exhibit uneven perception of image tokens located at different positions within the two-dimensional space: prioritizing image tokens from the bottom-right region since in the one-dimensional sequence, these tokens are positionally closer to the instruction tokens. This biased perception leads to insufficient image-instruction interaction and suboptimal multimodal alignment. We refer to this phenomenon as image alignment bias. To enhance instruction's perception of image tokens at different spatial locations, we propose MCA-LLaVA, based on Manhattan distance, which extends the long-term decay to a two-dimensional, multi-directional spatial decay. MCA-LLaVA integrates the one-dimensional sequence order and two-dimensional spatial position of image tokens for positional modeling, mitigating hallucinations by alleviating image alignment bias. Experimental results of MCA-LLaVA across various hallucination and general benchmarks demonstrate its effectiveness and generality. The code can be accessed in https://github.com/ErikZ719/MCA-LLaVA.","short_abstract":"Hallucinations pose a significant challenge in Large Vision Language Models (LVLMs), with misalignment between multimodal features identified as a key contributing factor. This paper reveals the negative impact of the long-term decay in Rotary Position Encoding (RoPE), used for positional modeling in LVLMs, on multimod...","url_abs":"https://arxiv.org/abs/2507.09184","url_pdf":"https://arxiv.org/pdf/2507.09184v2","authors":"[\"Qiyan Zhao\",\"Xiaofeng Zhang\",\"Yiheng Li\",\"Yun Xing\",\"Xiaosong Yuan\",\"Feilong Tang\",\"Sinan Fan\",\"Xuhang Chen\",\"Xuyao Zhang\",\"Dahan Wang\"]","published":"2025-07-12T08:09:35Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":612185,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2895387,"paper_url":"https://arxiv.org/abs/2507.09184","paper_title":"MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models","repo_url":"https://github.com/ErikZ719/MCA-LLaVA","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
