{"ID":2869729,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.13767","arxiv_id":"2509.13767","title":"VocSegMRI: Multimodal Learning for Precise Vocal Tract Segmentation in Real-time MRI","abstract":"Accurate segmentation of articulatory structures in real-time MRI (rtMRI) remains challenging, as existing methods rely primarily on visual cues and overlook complementary information from synchronized speech signals. We propose VocSegMRI, a multimodal framework integrating video, audio, and phonological inputs via cross-attention fusion and a contrastive learning objective that improves cross-modal alignment and segmentation precision. Evaluated on USC-75 and further validated via zero-shot transfer on USC-TIMIT, VocSegMRI outperforms unimodal and multimodal baselines, with ablations confirming the contribution of each component.","short_abstract":"Accurate segmentation of articulatory structures in real-time MRI (rtMRI) remains challenging, as existing methods rely primarily on visual cues and overlook complementary information from synchronized speech signals. We propose VocSegMRI, a multimodal framework integrating video, audio, and phonological inputs via cro...","url_abs":"https://arxiv.org/abs/2509.13767","url_pdf":"https://arxiv.org/pdf/2509.13767v4","authors":"[\"Daiqi Liu\",\"Johannes Enk\",\"Maureen Stone\",\"Fangxu Xing\",\"Tomás Arias-Vergara\",\"Jerry L. Prince\",\"Jana Hutter\",\"Jonghye Woo\",\"Andreas Maier\",\"Paula Andrea Pérez-Toro\"]","published":"2025-09-17T07:32:00Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false}