{"ID":2896067,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.07633","arxiv_id":"2507.07633","title":"T-GVC: Trajectory-Guided Generative Video Coding at Ultra-Low Bitrates","abstract":"Recent advances in video generation techniques have given rise to an emerging paradigm of generative video coding for Ultra-Low Bitrate (ULB) scenarios by leveraging powerful generative priors. However, most existing methods are limited by domain specificity (e.g., facial or human videos) or excessive dependence on high-level text guidance, which tend to inadequately capture fine-grained motion details, leading to unrealistic or incoherent reconstructions. To address these challenges, we propose Trajectory-Guided Generative Video Coding (dubbed T-GVC), a novel framework that bridges low-level motion tracking with high-level semantic understanding. T-GVC features a semantic-aware sparse motion sampling pipeline that extracts pixel-wise motion as sparse trajectory points based on their semantic importance, significantly reducing the bitrate while preserving critical temporal semantic information. In addition, by integrating trajectory-aligned loss constraints into diffusion processes, we introduce a training-free guidance mechanism in latent space to ensure physically plausible motion patterns without sacrificing the inherent capabilities of generative models. Experimental results demonstrate that T-GVC outperforms both traditional and neural video codecs under ULB conditions. Furthermore, additional experiments confirm that our framework achieves more precise motion control than existing text-guided methods, paving the way for a novel direction of generative video coding guided by geometric motion modeling.","short_abstract":"Recent advances in video generation techniques have given rise to an emerging paradigm of generative video coding for Ultra-Low Bitrate (ULB) scenarios by leveraging powerful generative priors. However, most existing methods are limited by domain specificity (e.g., facial or human videos) or excessive dependence on hig...","url_abs":"https://arxiv.org/abs/2507.07633","url_pdf":"https://arxiv.org/pdf/2507.07633v4","authors":"[\"Zhitao Wang\",\"Hengyu Man\",\"Wenrui Li\",\"Xingtao Wang\",\"Xiaopeng Fan\",\"Debin Zhao\"]","published":"2025-07-10T11:01:58Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.MM\"]","methods":"[\"Diffusion Model\"]","has_code":false}