{"ID":2882797,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.09789","arxiv_id":"2508.09789","title":"Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations","abstract":"Existing video recommender systems rely primarily on user-defined metadata or on low-level visual and acoustic signals extracted by specialised encoders. These low-level features describe what appears on the screen but miss deeper semantics such as intent, humour, and world knowledge that make clips resonate with viewers. For example, is a 30-second clip simply a singer on a rooftop, or an ironic parody filmed amid the fairy chimneys of Cappadocia, Turkey? Such distinctions are critical to personalised recommendations yet remain invisible to traditional encoding pipelines. In this paper, we introduce a simple, recommendation system-agnostic zero-finetuning framework that injects high-level semantics into the recommendation pipeline by prompting an off-the-shelf Multimodal Large Language Model (MLLM) to summarise each clip into a rich natural-language description (e.g. \"a superhero parody with slapstick fights and orchestral stabs\"), bridging the gap between raw content and user intent. We use MLLM output with a state-of-the-art text encoder and feed it into standard collaborative, content-based, and generative recommenders. On the MicroLens-100K dataset, which emulates user interactions with TikTok-style videos, our framework consistently surpasses conventional video, audio, and metadata features in five representative models. Our findings highlight the promise of leveraging MLLMs as on-the-fly knowledge extractors to build more intent-aware video recommenders.","short_abstract":"Existing video recommender systems rely primarily on user-defined metadata or on low-level visual and acoustic signals extracted by specialised encoders. These low-level features describe what appears on the screen but miss deeper semantics such as intent, humour, and world knowledge that make clips resonate with viewe...","url_abs":"https://arxiv.org/abs/2508.09789","url_pdf":"https://arxiv.org/pdf/2508.09789v1","authors":"[\"Marco De Nadai\",\"Andreas Damianou\",\"Mounia Lalmas\"]","published":"2025-08-13T13:19:31Z","proceeding":"cs.IR","tasks":"[\"cs.IR\",\"cs.CV\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}