{"ID":2888037,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.00728","arxiv_id":"2508.00728","title":"YOLO-Count: Differentiable Object Counting for Text-to-Image Generation","abstract":"We propose YOLO-Count, a differentiable open-vocabulary object counting model that tackles both general counting challenges and enables precise quantity control for text-to-image (T2I) generation. A core contribution is the 'cardinality' map, a novel regression target that accounts for variations in object size and spatial distribution. Leveraging representation alignment and a hybrid strong-weak supervision scheme, YOLO-Count bridges the gap between open-vocabulary counting and T2I generation control. Its fully differentiable architecture facilitates gradient-based optimization, enabling accurate object count estimation and fine-grained guidance for generative models. Extensive experiments demonstrate that YOLO-Count achieves state-of-the-art counting accuracy while providing robust and effective quantity control for T2I systems.","short_abstract":"We propose YOLO-Count, a differentiable open-vocabulary object counting model that tackles both general counting challenges and enables precise quantity control for text-to-image (T2I) generation. A core contribution is the 'cardinality' map, a novel regression target that accounts for variations in object size and spa...","url_abs":"https://arxiv.org/abs/2508.00728","url_pdf":"https://arxiv.org/pdf/2508.00728v1","authors":"[\"Guanning Zeng\",\"Xiang Zhang\",\"Zirui Wang\",\"Haiyang Xu\",\"Zeyuan Chen\",\"Bingnan Li\",\"Zhuowen Tu\"]","published":"2025-08-01T15:51:39Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false}