{"ID":2848014,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.26412","arxiv_id":"2510.26412","title":"LoCoT2V-Bench: Benchmarking Long-Form and Complex Text-to-Video Generation","abstract":"Recent advances in text-to-video generation have achieved impressive performance on short clips, yet evaluating long-form generation under complex textual inputs remains a significant challenge. In response to this challenge, we present LoCoT2V-Bench, a benchmark for long video generation (LVG) featuring multi-scene prompts with hierarchical metadata (e.g., character settings and camera behaviors), constructed from collected real-world videos. We further propose LoCoT2V-Eval, a multi-dimensional framework covering perceptual quality, text-video alignment, temporal quality, dynamic quality, and Human Expectation Realization Degree (HERD), with an emphasis on aspects such as fine-grained text-video alignment and temporal character consistency. Experiments on 17 representative LVG models reveal pronounced capability disparities across evaluation dimensions, with strong perceptual quality and background consistency but markedly weaker fine-grained text-video alignment and character consistency. These findings suggest that improving prompt faithfulness and identity preservation remains a key challenge for long-form video generation. Our code and data are released at https://github.com/XqZeppelinhead0702/LoCoT2V-Bench","short_abstract":"Recent advances in text-to-video generation have achieved impressive performance on short clips, yet evaluating long-form generation under complex textual inputs remains a significant challenge. In response to this challenge, we present LoCoT2V-Bench, a benchmark for long video generation (LVG) featuring multi-scene pr...","url_abs":"https://arxiv.org/abs/2510.26412","url_pdf":"https://arxiv.org/pdf/2510.26412v3","authors":"[\"Xiangqing Zheng\",\"Chengyue Wu\",\"Kehai Chen\",\"Min Zhang\"]","published":"2025-10-30T12:00:46Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[]","has_code":false,"code_links":[{"ID":607585,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2848014,"paper_url":"https://arxiv.org/abs/2510.26412","paper_title":"LoCoT2V-Bench: Benchmarking Long-Form and Complex Text-to-Video Generation","repo_url":"https://github.com/XqZeppelinhead0702/LoCoT2V-Bench","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}