{"ID":2883015,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.08550","arxiv_id":"2508.08550","title":"Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization","abstract":"Video dubbing aims to translate original speech in visual media programs from the source language to the target language, relying on neural machine translation and text-to-speech technologies. Due to varying information densities across languages, target speech often mismatches the source speech duration, causing audio-video synchronization issues that significantly impact viewer experience. In this study, we approach duration alignment in LLM-based video dubbing machine translation as a preference optimization problem. We propose the Segment Supervised Preference Optimization (SSPO) method, which employs a segment-wise sampling strategy and fine-grained loss to mitigate duration mismatches between source and target lines. Experimental results demonstrate that SSPO achieves superior performance in duration alignment tasks.","short_abstract":"Video dubbing aims to translate original speech in visual media programs from the source language to the target language, relying on neural machine translation and text-to-speech technologies. Due to varying information densities across languages, target speech often mismatches the source speech duration, causing audio...","url_abs":"https://arxiv.org/abs/2508.08550","url_pdf":"https://arxiv.org/pdf/2508.08550v1","authors":"[\"Chaoqun Cui\",\"Liangbin Huang\",\"Shijing Wang\",\"Zhe Tong\",\"Zhaolong Huang\",\"Xiao Zeng\",\"Xiaofeng Liu\"]","published":"2025-08-12T01:38:31Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.CL\"]","methods":"[\"Large Language Model\"]","has_code":false}
