{"ID":2866695,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.22718","arxiv_id":"2509.22718","title":"PerformSinger: Multimodal Singing Voice Synthesis Leveraging Synchronized Lip Cues from Singing Performance Videos","abstract":"Existing singing voice synthesis (SVS) models largely rely on fine-grained, phoneme-level durations, which limits their practical application. These methods overlook the complementary role of visual information in duration prediction.To address these issues, we propose PerformSinger, a pioneering multimodal SVS framework, which incorporates lip cues from video as a visual modality, enabling high-quality \"duration-free\" singing voice synthesis. PerformSinger comprises parallel multi-branch multimodal encoders, a feature fusion module, a duration and variational prediction network, a mel-spectrogram decoder and a vocoder. The fusion module, composed of adapter and fusion blocks, employs a progressive fusion strategy within an aligned semantic space to produce high-quality multimodal feature representations, thereby enabling accurate duration prediction and high-fidelity audio synthesis. To facilitate the research, we design, collect and annotate a novel SVS dataset involving synchronized video streams and precise phoneme-level manual annotations. Extensive experiments demonstrate the state-of-the-art performance of our proposal in both subjective and objective evaluations. The code and dataset will be publicly available.","short_abstract":"Existing singing voice synthesis (SVS) models largely rely on fine-grained, phoneme-level durations, which limits their practical application. These methods overlook the complementary role of visual information in duration prediction.To address these issues, we propose PerformSinger, a pioneering multimodal SVS framewo...","url_abs":"https://arxiv.org/abs/2509.22718","url_pdf":"https://arxiv.org/pdf/2509.22718v1","authors":"[\"Ke Gu\",\"Zhicong Wu\",\"Peng Bai\",\"Sitong Qiao\",\"Zhiqi Jiang\",\"Junchen Lu\",\"Xiaodong Shi\",\"Xinyuan Qian\"]","published":"2025-09-24T16:30:40Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.MM\",\"cs.SD\"]","methods":"[]","has_code":false}
