{"ID":2834019,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.02759","arxiv_id":"2512.02759","title":"Towards Language-Independent Face-Voice Association with Multimodal Foundation Models","abstract":"This paper describes the UZH-CL system submitted to the FAME2026 Challenge. The challenge focuses on cross-modal verification under unique multilingual conditions, specifically unseen and unheard languages. Our approach investigates two distinct architectures, consisting of a baseline dual-encoder system trained from scratch using contrastive and orthogonal projection losses, and a foundation model approach leveraging ImageBind with LoRA. To address the data scarcity and language constraints of the challenge, we curated an external Arabic dataset from VoxBlink. Our best-performing system, ImageBind-LoRA, demonstrates remarkable cross-lingual generalization: despite being fine-tuned exclusively on Arabic audio, it achieved an EER of 24.73% on the evaluation set (English and German), securing 2nd place in the competition.","short_abstract":"This paper describes the UZH-CL system submitted to the FAME2026 Challenge. The challenge focuses on cross-modal verification under unique multilingual conditions, specifically unseen and unheard languages. Our approach investigates two distinct architectures, consisting of a baseline dual-encoder system trained from s...","url_abs":"https://arxiv.org/abs/2512.02759","url_pdf":"https://arxiv.org/pdf/2512.02759v1","authors":"[\"Aref Farhadipour\",\"Teodora Vukovic\",\"Volker Dellwo\"]","published":"2025-12-02T13:39:53Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.SD\",\"eess.IV\"]","methods":"[\"LoRA\"]","has_code":false}