{"ID":2828067,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.22170","arxiv_id":"2512.22170","title":"SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models","abstract":"Post-training alignment of video generation models with human preferences is a critical goal. Developing effective Reward Models (RMs) for this process faces significant methodological hurdles. Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise. Concurrently, the architectural design of VLM-based RMs, particularly their output mechanisms, remains underexplored. Furthermore, RM is susceptible to reward hacking in post-training. To mitigate these limitations, we propose SoliReward, a systematic framework for video RM training. Our framework first sources high-quality, cost-efficient data via single-item binary annotations, then constructs preference pairs using a cross-prompt pairing strategy. Architecturally, we employ a Hierarchical Progressive Query Attention mechanism to enhance feature aggregation. Finally, we introduce a modified BT loss that explicitly accommodates win-tie scenarios. This approach regularizes the RM's score distribution for positive samples, providing more nuanced preference signals to alleviate over-focus on a small number of top-scoring samples. Our approach is validated on benchmarks evaluating physical plausibility, subject deformity, and semantic alignment, demonstrating improvements in direct RM evaluation metrics and in the efficacy of post-training on video generation models. Code and benchmark are available at https://github.com/lian700/SoliReward.","short_abstract":"Post-training alignment of video generation models with human preferences is a critical goal. Developing effective Reward Models (RMs) for this process faces significant methodological hurdles. Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise. Concurrently, the ar...","url_abs":"https://arxiv.org/abs/2512.22170","url_pdf":"https://arxiv.org/pdf/2512.22170v3","authors":"[\"Jiesong Lian\",\"Ruizhe Zhong\",\"Zixiang Zhou\",\"Xiaoyue Mi\",\"Long Hu\",\"Yuan Zhou\",\"Qinglin Lu\",\"Yixue Hao\",\"Junchi Yan\"]","published":"2025-12-17T14:28:23Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.CV\"]","methods":"[]","has_code":false,"code_links":[{"ID":605846,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2828067,"paper_url":"https://arxiv.org/abs/2512.22170","paper_title":"SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models","repo_url":"https://github.com/lian700/SoliReward","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
