{"ID":2832053,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.06883","arxiv_id":"2512.06883","title":"Structural and Disentangled Adaptation of Large Vision Language Models for Multimodal Recommendation","abstract":"Multimodal recommendation enhances accuracy by leveraging visual and textual signals, and its success largely depends on learning high-quality cross-modal representations. Recent advances in Large Vision-Language Models (LVLMs) offer unified multimodal representation learning, making them a promising backbone. However, applying LVLMs to recommendation remains challenging due to (i) representation misalignment, where domain gaps between item data and general pre-training lead to unaligned embedding spaces, and (ii) gradient conflicts during fine-tuning, where shared adapters cause interference and a lack of discriminative power. To address this, we propose SDA, a lightweight framework for Structural and Disentangled Adaptation, which integrates two components: Cross-Modal Structural Alignment (CMSA) and Modality-Disentangled Adaptation. CMSA aligns embeddings using intra-modal structures as a soft teacher, while MoDA mitigates gradient conflicts via expertized, gated low-rank paths to disentangle gradient flows. Experiments on three public Amazon datasets show SDA integrates seamlessly with existing multimodal and sequential recommenders, yielding average gains of 6.15% in Hit@10 and 8.64% in NDCG@10. It also achieves up to 12.83% and 18.70% gains on long-tail items with minimal inference overhead. Our code and full experimental results are available at https://github.com/RaoZhongtao/SDA.","short_abstract":"Multimodal recommendation enhances accuracy by leveraging visual and textual signals, and its success largely depends on learning high-quality cross-modal representations. Recent advances in Large Vision-Language Models (LVLMs) offer unified multimodal representation learning, making them a promising backbone. However,...","url_abs":"https://arxiv.org/abs/2512.06883","url_pdf":"https://arxiv.org/pdf/2512.06883v2","authors":"[\"Zhongtao Rao\",\"Peilin Zhou\",\"Dading Chong\",\"Zhiwei Chen\",\"Shoujin Wang\",\"Nan Tang\"]","published":"2025-12-07T15:18:04Z","proceeding":"cs.IR","tasks":"[\"cs.IR\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":606195,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2832053,"paper_url":"https://arxiv.org/abs/2512.06883","paper_title":"Structural and Disentangled Adaptation of Large Vision Language Models for Multimodal Recommendation","repo_url":"https://github.com/RaoZhongtao/SDA","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
