{"ID":2891539,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.17958","arxiv_id":"2507.17958","title":"VIBE: Video-Input Brain Encoder for fMRI Response Modeling","abstract":"We present VIBE, a two-stage Transformer that fuses multi-modal video, audio, and text features to predict fMRI activity. Representations from open-source models (Qwen2.5, BEATs, Whisper, SlowFast, V-JEPA) are merged by a modality-fusion transformer and temporally decoded by a prediction transformer with rotary embeddings. Trained on 65 hours of movie data from the CNeuroMod dataset and ensembled across 20 seeds, VIBE attains mean parcel-wise Pearson correlations of 0.3225 on in-distribution Friends S07 and 0.2125 on six out-of-distribution films. An earlier iteration of the same architecture obtained 0.3198 and 0.2096, respectively, winning Phase-1 and placing second overall in the Algonauts 2025 Challenge.","short_abstract":"We present VIBE, a two-stage Transformer that fuses multi-modal video, audio, and text features to predict fMRI activity. Representations from open-source models (Qwen2.5, BEATs, Whisper, SlowFast, V-JEPA) are merged by a modality-fusion transformer and temporally decoded by a prediction transformer with rotary embeddi...","url_abs":"https://arxiv.org/abs/2507.17958","url_pdf":"https://arxiv.org/pdf/2507.17958v2","authors":"[\"Daniel Carlström Schad\",\"Shrey Dixit\",\"Janis Keck\",\"Viktor Studenyak\",\"Aleksandr Shpilevoi\",\"Andrej Bicanski\"]","published":"2025-07-23T22:02:56Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.CV\"]","methods":"[\"Transformer\"]","has_code":false}