{"ID":2890420,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.19110","arxiv_id":"2507.19110","title":"LISA: A Layer-wise Integration and Suppression Approach for Hallucination Mitigation in Multimodal Large Language Models","abstract":"Multimodal Large Language Models (MLLMs) excel in vision-language tasks such as image captioning but remain prone to object hallucinations, where they describe objects that do not appear in the image. To mitigate this, we propose LISA, a Layer-wise Integration and Suppression Approach. LISA leverages the layer-wise functional roles in MLLMs: shallow layers provide visual grounding, middle layers encode semantics, and deep layers tend to amplify spurious signals. First, layer-wise spectral modulation stabilizes attention by suppressing over-amplified activations in deeper layers while preserving alignment cues in earlier layers. Second, token-level logits from selected layers are fused via anchor-based routing, with token-wise anchor selection and soft logit fusion enabling adaptive integration during decoding. LISA is fully plug-and-play and can be seamlessly integrated into existing MLLMs, including Qwen2.5-VL. Experiments on multiple benchmarks show that LISA reduces hallucinations by up to 53.6% in $\\text{CHAIR}_\\text{I}$ and improves POPE F1 by up to 5.1%, demonstrating strong generalization across models and tasks. Our code is available at https://github.com/zhlisa1010-eng/LISA.","short_abstract":"Multimodal Large Language Models (MLLMs) excel in vision-language tasks such as image captioning but remain prone to object hallucinations, where they describe objects that do not appear in the image. To mitigate this, we propose LISA, a Layer-wise Integration and Suppression Approach. LISA leverages the layer-wise fun...","url_abs":"https://arxiv.org/abs/2507.19110","url_pdf":"https://arxiv.org/pdf/2507.19110v2","authors":"[\"Zhihui Guo\",\"Xin Man\",\"Hui Xu\",\"Jie Shao\",\"Zhiguo Jiang\",\"Xianchao Zhang\",\"Heng Tao Shen\"]","published":"2025-07-25T09:48:23Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":611776,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2890420,"paper_url":"https://arxiv.org/abs/2507.19110","paper_title":"LISA: A Layer-wise Integration and Suppression Approach for Hallucination Mitigation in Multimodal Large Language Models","repo_url":"https://github.com/zhlisa1010-eng/LISA","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}