{"ID":3084879,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-07T05:00:38.846751169Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05748","arxiv_id":"2606.05748","title":"UNIVID: Unified Vision-Language Model for Video Moderation","abstract":"Global-scale video moderation faces a dual challenge: the need for fine-grained multi-modal reasoning and the demand for interpretable outputs to support downstream enforcement. Traditional moderation systems often rely on fragmented black-box classifiers that are difficult to maintain and lack transparency. In this paper, we present UNIVID, a UNIfied VIsion-language model for video moDeration. Unlike standard classification models, UNIVID generates policy-aware captions that serve as an interpretable intermediate representation, enabling human-verifiable decisions and multi-task reusability. While existing open-source and commercial VLMs often suffer from safety-guardrail refusals and lack fine-grained policy alignment, we develop a specialized training data recipe that combines expert human-refined labels with synthetic data to align the model with our safety guidelines. By integrating UNIVID as the core captioner, we design a novel end-to-end video moderation system that reduces violation leakage by 42.7% and overkill rate by 37.0% relatively. Meanwhile, by replacing over 1,000 policy-specific models with a single UNIVID backbone, we recycled extensive computation resources while reducing engineering maintenance overhead. To our knowledge, this is one of the first reports of a high-efficiency captioning VLM successfully supporting industrial-scale moderation and cross-functional business.","short_abstract":"Global-scale video moderation faces a dual challenge: the need for fine-grained multi-modal reasoning and the demand for interpretable outputs to support downstream enforcement. Traditional moderation systems often rely on fragmented black-box classifiers that are difficult to maintain and lack transparency. In this pa...","url_abs":"https://arxiv.org/abs/2606.05748","url_pdf":"https://arxiv.org/pdf/2606.05748v1","authors":"[\"Kejuan Yang\",\"Yizhuo Zhang\",\"Mingyuan Du\",\"Yue Zhang\",\"Dixin Zheng\",\"Kaili Zhao\",\"Yang Xiao\",\"Hanzhong Liang\",\"Kenan Xiao\"]","published":"2026-06-04T06:20:23Z","proceeding":"cs.MM","tasks":"[\"cs.MM\",\"cs.AI\",\"cs.CL\"]","methods":"[\"Language Model\"]","has_code":false}
