{"ID":2861958,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.00647","arxiv_id":"2510.00647","title":"MCM-DPO: Multifaceted Cross-Modal Direct Preference Optimization for Alt-text Generation","abstract":"The alt-text generation task produces concise, context-relevant descriptions of images, enabling blind and low-vision users to access online images. Despite the capabilities of large vision-language models, alt-text generation performance remains limited due to noisy user annotations, inconsistent standards, and MLLMs' insensitivity to contextual information. Previous efforts to fine-tune MLLMs using supervised fine-tuning (SFT) have struggled, as SFT relies on accurate target annotations, which are often flawed in user-generated alt-text. To address this, we propose Multi-faceted Cross-modal Direct Preference Optimization (MCM-DPO), which improves alt-text generation by learning to identify better options in preference pairs without requiring precise annotations. MCM-DPO optimizes preferences across single, paired, and multi-preference dimensions, covering textual, visual, and cross-modal factors. In light of the scarcity of high-quality annotated and preference-labeled datasets for alt-text, we constructed two large-scale, high-quality datasets named TAlt and PAlt, sourced from Twitter and Pinterest. These datasets include 202k annotated alt-text samples and 18k preference pairs that cover diverse preference dimensions, aiming to support further research in this domain. Experimental results show that our proposed MCM-DPO method consistently outperforms both DPO and SFT, establishing a new state of the art in alt-text generation. We release the code and data here: https://github.com/LVUGAI/MCM-DPO","short_abstract":"The alt-text generation task produces concise, context-relevant descriptions of images, enabling blind and low-vision users to access online images. Despite the capabilities of large vision-language models, alt-text generation performance remains limited due to noisy user annotations, inconsistent standards, and MLLMs'...","url_abs":"https://arxiv.org/abs/2510.00647","url_pdf":"https://arxiv.org/pdf/2510.00647v1","authors":"[\"Jinlan Fu\",\"Shenzhen Huangfu\",\"Hao Fei\",\"Yichong Huang\",\"Xiaoyu Shen\",\"Xipeng Qiu\",\"See-Kiong Ng\"]","published":"2025-10-01T08:25:18Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":608859,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2861958,"paper_url":"https://arxiv.org/abs/2510.00647","paper_title":"MCM-DPO: Multifaceted Cross-Modal Direct Preference Optimization for Alt-text Generation","repo_url":"https://github.com/LVUGAI/MCM-DPO","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
