{"ID":2843277,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.08031","arxiv_id":"2511.08031","title":"Multi-modal Deepfake Detection and Localization with FPN-Transformer","abstract":"The rapid advancement of generative adversarial networks (GANs) and diffusion models has enabled the creation of highly realistic deepfake content, posing significant threats to digital trust across audio-visual domains. While unimodal detection methods have shown progress in identifying synthetic media, their inability to leverage cross-modal correlations and precisely localize forged segments limits their practicality against sophisticated, fine-grained manipulations. To address this, we introduce a multi-modal deepfake detection and localization framework based on a Feature Pyramid-Transformer (FPN-Transformer), addressing critical gaps in cross-modal generalization and temporal boundary regression. The proposed approach utilizes pre-trained self-supervised models (WavLM for audio, CLIP for video) to extract hierarchical temporal features. A multi-scale feature pyramid is constructed through R-TLM blocks with localized attention mechanisms, enabling joint analysis of cross-context temporal dependencies. The dual-branch prediction head simultaneously predicts forgery probabilities and refines temporal offsets of manipulated segments, achieving frame-level localization precision. We evaluate our approach on the test set of the IJCAI'25 DDL-AV benchmark, showing a good performance with a final score of 0.7535 for cross-modal deepfake detection and localization in challenging environments. Experimental results confirm the effectiveness of our approach and provide a novel way for generalized deepfake detection. Our code is available at https://github.com/Zig-HS/MM-DDL","short_abstract":"The rapid advancement of generative adversarial networks (GANs) and diffusion models has enabled the creation of highly realistic deepfake content, posing significant threats to digital trust across audio-visual domains. While unimodal detection methods have shown progress in identifying synthetic media, their inabilit...","url_abs":"https://arxiv.org/abs/2511.08031","url_pdf":"https://arxiv.org/pdf/2511.08031v1","authors":"[\"Chende Zheng\",\"Ruiqi Suo\",\"Zhoulin Ji\",\"Jingyi Deng\",\"Fangbin Yi\",\"Chenhao Lin\",\"Chao Shen\"]","published":"2025-11-11T09:33:39Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Diffusion Model\",\"Transformer\",\"Generative Adversarial Network\"]","has_code":false,"code_links":[{"ID":607194,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2843277,"paper_url":"https://arxiv.org/abs/2511.08031","paper_title":"Multi-modal Deepfake Detection and Localization with FPN-Transformer","repo_url":"https://github.com/Zig-HS/MM-DDL","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}