{"ID":2859195,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.05652","arxiv_id":"2510.05652","title":"SD-MVSum: Script-Driven Multimodal Video Summarization Method and Datasets","abstract":"In this work, we present a method and two large-scale datasets for Script-Driven Multimodal Video Summarization. The proposed method, SD-MVSum, builds on our earlier SD-VSum method for script-driven video summarization, which considered just the visual content of the video. SD-MVSum takes into account, in addition to the visual modality, the relevance of the user-provided script with the spoken content (i.e., audio transcript) of the video. The dependence between each considered pair of data modalities, i.e., script-video and script-transcript, is modeled using a new weighted cross-modal attention mechanism. This mechanism explicitly exploits the semantic similarity between the paired modalities in order to promote the parts of the full-length video with the highest relevance to the user-provided script. Furthermore, we extend two large-scale datasets for script-driven (S-VideoXum) and generic (MrHiSum) video summarization, to make them suitable for training and evaluation of script-driven multimodal video summarization methods. Experimental comparisons document the competitiveness of the proposed SD-MVSum method against other SotA approaches for script-driven and generic video summarization. Our new method and extended datasets are available at: https://github.com/IDT-ITI/SD-MVSum.","short_abstract":"In this work, we present a method and two large-scale datasets for Script-Driven Multimodal Video Summarization. The proposed method, SD-MVSum, builds on our earlier SD-VSum method for script-driven video summarization, which considered just the visual content of the video. SD-MVSum takes into account, in addition to t...","url_abs":"https://arxiv.org/abs/2510.05652","url_pdf":"https://arxiv.org/pdf/2510.05652v2","authors":"[\"Manolis Mylonas\",\"Charalampia Zerva\",\"Evlampios Apostolidis\",\"Vasileios Mezaris\"]","published":"2025-10-07T08:03:56Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false,"code_links":[{"ID":608615,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2859195,"paper_url":"https://arxiv.org/abs/2510.05652","paper_title":"SD-MVSum: Script-Driven Multimodal Video Summarization Method and Datasets","repo_url":"https://github.com/IDT-ITI/SD-MVSum","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}