{"ID":2840539,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.13243","arxiv_id":"2511.13243","title":"Uncovering and Mitigating Transient Blindness in Multimodal Model Editing","abstract":"Multimodal Model Editing (MMED) aims to correct erroneous knowledge in multimodal models. Existing evaluation methods, adapted from textual model editing, overstate success by relying on low-similarity or random inputs, obscure overfitting. We propose a comprehensive locality evaluation framework, covering three key dimensions: random-image locality, no-image locality, and consistent-image locality, operationalized through seven distinct data types, enabling a detailed and structured analysis of multimodal edits. We introduce De-VQA, a dynamic evaluation for visual question answering, uncovering a phenomenon we term transient blindness, overfitting to edit-similar text while ignoring visuals. Token analysis shows edits disproportionately affect textual tokens. We propose locality-aware adversarial losses to balance cross-modal representations. Empirical results demonstrate that our approach consistently outperforms existing baselines, reducing transient blindness and improving locality by 17% on average.","short_abstract":"Multimodal Model Editing (MMED) aims to correct erroneous knowledge in multimodal models. Existing evaluation methods, adapted from textual model editing, overstate success by relying on low-similarity or random inputs, obscure overfitting. We propose a comprehensive locality evaluation framework, covering three key di...","url_abs":"https://arxiv.org/abs/2511.13243","url_pdf":"https://arxiv.org/pdf/2511.13243v1","authors":"[\"Xiaoqi Han\",\"Ru Li\",\"Ran Yi\",\"Hongye Tan\",\"Zhuomin Liang\",\"Víctor Gutiérrez-Basulto\",\"Jeff Z. Pan\"]","published":"2025-11-17T11:04:33Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.CV\"]","methods":"[]","has_code":false}
