{"ID":2921617,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-03T05:56:00.181519634Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.01066","arxiv_id":"2606.01066","title":"Before the Model Learns the Bug:Fuzzing RLVR Verifiers","abstract":"Reinforcement learning with verifiable rewards (RLVR) replaces human preference labels with executable reward functions such as math answer checkers, JSON tool-call validators, and code unit-test harnesses. That makes the reward partly a software artifact: if the verifier is wrong, optimization can learn the bug. We study this failure mode with a lightweight verifier-fuzzing framework that generates adversarial completions, compares buggy and stricter reference verifiers, logs paired decisions, and reports false-positive, false-negative, disagreement, exploit, and uncertainty metrics.","short_abstract":"Reinforcement learning with verifiable rewards (RLVR) replaces human preference labels with executable reward functions such as math answer checkers, JSON tool-call validators, and code unit-test harnesses. That makes the reward partly a software artifact: if the verifier is wrong, optimization can learn the bug. We st...","url_abs":"https://arxiv.org/abs/2606.01066","url_pdf":"https://arxiv.org/pdf/2606.01066v1","authors":"[\"Jaideep Ray\"]","published":"2026-05-31T07:18:07Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Reinforcement Learning\"]","has_code":false}
