{"ID":2896778,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.07306","arxiv_id":"2507.07306","title":"ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning","abstract":"LLM-based translation agents have achieved highly human-like translation results and are capable of handling longer and more complex contexts with greater efficiency. However, they are typically limited to text-only inputs. In this paper, we introduce ViDove, a translation agent system designed for multimodal input. Inspired by the workflow of human translators, ViDove leverages visual and contextual background information to enhance the translation process. Additionally, we integrate a multimodal memory system and long-short term memory modules enriched with domain-specific knowledge, enabling the agent to perform more accurately and adaptively in real-world scenarios. As a result, ViDove achieves significantly higher translation quality in both subtitle generation and general translation tasks, with a 28% improvement in BLEU scores and a 15% improvement in SubER compared to previous state-of-the-art baselines. Moreover, we introduce DoveBench, a new benchmark for long-form automatic video subtitling and translation, featuring 17 hours of high-quality, human-annotated data. Our code is available here: https://github.com/pigeonai-org/ViDove","short_abstract":"LLM-based translation agents have achieved highly human-like translation results and are capable of handling longer and more complex contexts with greater efficiency. However, they are typically limited to text-only inputs. In this paper, we introduce ViDove, a translation agent system designed for multimodal input. In...","url_abs":"https://arxiv.org/abs/2507.07306","url_pdf":"https://arxiv.org/pdf/2507.07306v1","authors":"[\"Yichen Lu\",\"Wei Dai\",\"Jiaen Liu\",\"Ching Wing Kwok\",\"Zongheng Wu\",\"Xudong Xiao\",\"Ao Sun\",\"Sheng Fu\",\"Jianyuan Zhan\",\"Yian Wang\",\"Takatomo Saito\",\"Sicheng Lai\"]","published":"2025-07-09T22:05:46Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CL\",\"eess.AS\"]","methods":"[\"Large Language Model\"]","has_code":false,"code_links":[{"ID":612308,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2896778,"paper_url":"https://arxiv.org/abs/2507.07306","paper_title":"ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning","repo_url":"https://github.com/pigeonai-org/ViDove","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}