{"ID":2890353,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.19002","arxiv_id":"2507.19002","title":"Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment","abstract":"Contemporary image generation systems have achieved high fidelity and superior aesthetic quality beyond basic text-image alignment. However, existing evaluation frameworks have failed to evolve in parallel. This study reveals that human preference reward models fine-tuned based on CLIP and BLIP architectures have inherent flaws: they inappropriately assign low scores to images with rich details and high aesthetic value, creating a significant discrepancy with actual human aesthetic preferences. To address this issue, we design a novel evaluation score, ICT (Image-Contained-Text) score, that achieves and surpasses the objectives of text-image alignment by assessing the degree to which images represent textual content. Building upon this foundation, we further train an HP (High-Preference) score model using solely the image modality to enhance image aesthetics and detail quality while maintaining text-image alignment. Experiments demonstrate that the proposed evaluation model improves scoring accuracy by over 10\\% compared to existing methods, and achieves significant results in optimizing state-of-the-art text-to-image models. This research provides theoretical and empirical support for evolving image generation technology toward higher-order human aesthetic preferences. Code is available at https://github.com/BarretBa/ICTHP.","short_abstract":"Contemporary image generation systems have achieved high fidelity and superior aesthetic quality beyond basic text-image alignment. However, existing evaluation frameworks have failed to evolve in parallel. This study reveals that human preference reward models fine-tuned based on CLIP and BLIP architectures have inher...","url_abs":"https://arxiv.org/abs/2507.19002","url_pdf":"https://arxiv.org/pdf/2507.19002v1","authors":"[\"Ying Ba\",\"Tianyu Zhang\",\"Yalong Bai\",\"Wenyi Mo\",\"Tao Liang\",\"Bing Su\",\"Ji-Rong Wen\"]","published":"2025-07-25T07:01:50Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false,"code_links":[{"ID":611769,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2890353,"paper_url":"https://arxiv.org/abs/2507.19002","paper_title":"Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment","repo_url":"https://github.com/BarretBa/ICTHP","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}