{"ID":2843552,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.08521","arxiv_id":"2511.08521","title":"UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist","abstract":"While specialized AI models excel at isolated video tasks like generation or understanding, real-world applications demand complex, iterative workflows that combine these capabilities. To bridge this gap, we introduce UniVA, an open-source, omni-capable multi-agent framework for next-generation video generalists that unifies video understanding, segmentation, editing, and generation into cohesive workflows. UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow: a planner agent interprets user intentions and decomposes them into structured video-processing steps, while executor agents execute these through modular, MCP-based tool servers (for analysis, generation, editing, tracking, etc.). Through a hierarchical multi-level memory (global knowledge, task context, and user-specific preferences), UniVA sustains long-horizon reasoning, contextual continuity, and inter-agent communication, enabling interactive and self-reflective video creation with full traceability. This design enables iterative and any-conditioned video workflows (e.g., text/image/video-conditioned generation $\\rightarrow$ multi-round editing $\\rightarrow$ object segmentation $\\rightarrow$ compositional synthesis) that were previously cumbersome to achieve with single-purpose models or monolithic video-language models. We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation, to rigorously evaluate such agentic video systems. Both UniVA and UniVA-Bench are fully open-sourced, aiming to catalyze research on interactive, agentic, and general-purpose video intelligence for the next generation of multimodal AI systems. (https://univa.online/)","short_abstract":"While specialized AI models excel at isolated video tasks like generation or understanding, real-world applications demand complex, iterative workflows that combine these capabilities. To bridge this gap, we introduce UniVA, an open-source, omni-capable multi-agent framework for next-generation video generalists that u...","url_abs":"https://arxiv.org/abs/2511.08521","url_pdf":"https://arxiv.org/pdf/2511.08521v1","authors":"[\"Zhengyang Liang\",\"Daoan Zhang\",\"Huichi Zhou\",\"Rui Huang\",\"Bobo Li\",\"Yuechen Zhang\",\"Shengqiong Wu\",\"Xiaohan Wang\",\"Jiebo Luo\",\"Lizi Liao\",\"Hao Fei\"]","published":"2025-11-11T17:58:13Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Language Model\"]","project_urls":"[\"https://univa.online/\"]","has_code":false}
