{"ID":2828707,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.14944","arxiv_id":"2512.14944","title":"PuzzleCraft: Exploration-Aware Curriculum Learning for Puzzle-Based RLVR in VLMs","abstract":"RL post-training with verifiable rewards (RLVR) has become a practical route to eliciting chain-of-thought reasoning in vision--language models (VLMs), but scaling it in the visual domain remains challenging due to costly or noisy supervision and reliance on external verifiers. Puzzle-based RLVR is a promising alternative, yet existing approaches often treat puzzle rewards as flat or sparse, which weakens group-relative learning signal. Existing curriculum strategies are overly restrictive: they rely mainly on reward statistics and do not account for exploration in the solution space, which can lead to collapsed rollout dynamics. Further, RL post-training can induce reasoning--answer inconsistency as training progresses. To address these shortcomings, we present PuzzleCraft, a supervision-free framework that scales vision-centric RLVR using a set of lightweight puzzle environments with built-in verification. PuzzleCraft instantiates three puzzles inspired by classic visual pretext tasks: PatchFit, Rotation, and Jigsaw. We introduce a curriculum that combines difficulty with an exploration signal derived from solution-space dispersion, and use it to downweight collapsed prompt groups. In addition, we introduce a new post-training metric, Reasoning-Answer Consistency (RAC), to measure the degree that the chain-of-though supports the answer, and show our exploration-aware curriculum improves RAC and downstream performance. Across a broad suite of vision-centric benchmarks, PuzzleCraft improves robustness and reasoning consistency, yielding consistent downstream gains on both Qwen2.5-VL and Qwen3-VL backbones. Overall, our results suggest that scalable puzzle-based RLVR benefits from curricula that account for both difficulty and solution-space collapse, together with explicit consistency-enhancing schemes.","short_abstract":"RL post-training with verifiable rewards (RLVR) has become a practical route to eliciting chain-of-thought reasoning in vision--language models (VLMs), but scaling it in the visual domain remains challenging due to costly or noisy supervision and reliance on external verifiers. Puzzle-based RLVR is a promising alternat...","url_abs":"https://arxiv.org/abs/2512.14944","url_pdf":"https://arxiv.org/pdf/2512.14944v2","authors":"[\"Ahmadreza Jeddi\",\"Hakki Can Karaimer\",\"Hue Nguyen\",\"Zhongling Wang\",\"Ke Zhao\",\"Javad Rajabi\",\"Ran Zhang\",\"Raghav Goyal\",\"Konstantinos G. Derpanis\",\"Babak Taati\",\"Radek Grzeszczuk\"]","published":"2025-12-16T22:17:25Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Language Model\",\"LoRA\"]","has_code":false}