{"ID":2877066,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.20546","arxiv_id":"2508.20546","title":"MM-HSD: Multi-Modal Hate Speech Detection in Videos","abstract":"While hate speech detection (HSD) has been extensively studied in text, existing multi-modal approaches remain limited, particularly in videos. As modalities are not always individually informative, simple fusion methods fail to fully capture inter-modal dependencies. Moreover, previous work often omits relevant modalities such as on-screen text and audio, which may contain subtle hateful content and thus provide essential cues, both individually and in combination with others. In this paper, we present MM-HSD, a multi-modal model for HSD in videos that integrates video frames, audio, and text derived from speech transcripts and from frames (i.e.~on-screen text) together with features extracted by Cross-Modal Attention (CMA). We are the first to use CMA as an early feature extractor for HSD in videos, to systematically compare query/key configurations, and to evaluate the interactions between different modalities in the CMA block. Our approach leads to improved performance when on-screen text is used as a query and the rest of the modalities serve as a key. Experiments on the HateMM dataset show that MM-HSD outperforms state-of-the-art methods on M-F1 score (0.874), using concatenation of transcript, audio, video, on-screen text, and CMA for feature extraction on raw embeddings of the modalities. The code is available at https://github.com/idiap/mm-hsd","short_abstract":"While hate speech detection (HSD) has been extensively studied in text, existing multi-modal approaches remain limited, particularly in videos. As modalities are not always individually informative, simple fusion methods fail to fully capture inter-modal dependencies. Moreover, previous work often omits relevant modali...","url_abs":"https://arxiv.org/abs/2508.20546","url_pdf":"https://arxiv.org/pdf/2508.20546v1","authors":"[\"Berta Céspedes-Sarrias\",\"Carlos Collado-Capell\",\"Pablo Rodenas-Ruiz\",\"Olena Hrynenko\",\"Andrea Cavallaro\"]","published":"2025-08-28T08:36:35Z","proceeding":"cs.MM","tasks":"[\"cs.MM\",\"cs.AI\"]","methods":"[]","has_code":false,"code_links":[{"ID":610335,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2877066,"paper_url":"https://arxiv.org/abs/2508.20546","paper_title":"MM-HSD: Multi-Modal Hate Speech Detection in Videos","repo_url":"https://github.com/idiap/mm-hsd","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}