{"ID":2826392,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.19687","arxiv_id":"2512.19687","title":"Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning","abstract":"We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built on PE, PE-AV makes several key contributions to extend representations to audio, and natively support joint embeddings across audio-video, audio-text, and video-text modalities. PE-AV's unified cross-modal embeddings enable novel tasks such as speech retrieval, and set a new state of the art across standard audio and video benchmarks. We unlock this by building a strong audiovisual data engine that synthesizes high-quality captions for O(100M) audio-video pairs, enabling large-scale supervision consistent across modalities. Our audio data includes speech, music, and general sound effects-avoiding single-domain limitations common in prior work. We exploit ten pairwise contrastive objectives, showing that scaling cross-modality and caption-type pairs strengthens alignment and improves zero-shot performance. We further develop PE-A-Frame by fine-tuning PE-AV with frame-level contrastive objectives, enabling fine-grained audio-frame-to-text alignment for tasks such as sound event detection.","short_abstract":"We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built on PE, PE-AV makes several key contributions to extend representations to audio, and natively support joint embeddings across audio-video, audio-text, and video-...","url_abs":"https://arxiv.org/abs/2512.19687","url_pdf":"https://arxiv.org/pdf/2512.19687v1","authors":"[\"Apoorv Vyas\",\"Heng-Jui Chang\",\"Cheng-Fu Yang\",\"Po-Yao Huang\",\"Luya Gao\",\"Julius Richter\",\"Sanyuan Chen\",\"Matt Le\",\"Piotr Dollár\",\"Christoph Feichtenhofer\",\"Ann Lee\",\"Wei-Ning Hsu\"]","published":"2025-12-22T18:59:07Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.CV\",\"cs.LG\"]","methods":"[]","has_code":false}
