{"ID":3004739,"CreatedAt":"2026-06-03T03:09:48.883664427Z","UpdatedAt":"2026-06-05T11:43:53.432517148Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.03800","arxiv_id":"2606.03800","title":"Trading Human Curation for Synthetic Augmentation in RLVR","abstract":"The supply of high-quality training tasks is a central bottleneck for reinforcement learning from verifiable rewards (RLVR) on agentic language models. Each task requires a sandboxed setup, a prompt, and a hand-authored reward function, and only tasks that pass a quality bar produce useful training signal. Hand-curation at this quality bar does not scale economically to the task counts effective RL training requires, and the substitution rate between automatically generated task variants and human-authored ones is not yet established. We investigate using pre-specified, gate-filtered augmentations of a small hand-authored base as a substitute for additional human curation during RLVR. We formalize the cost-adjusted trade rate $ρ_{\\text{cost}}$ between augmented and human-authored tasks, measure it through a controlled ablation across training corpora with varying augmentation share, and characterize the end-to-end economics of the augmentation pipeline. Substituting augmented content for additional human-authored tasks retains aggregate held-out generalization on a ten-benchmark suite spanning code, instruction following, reasoning, and multi-turn agentic function-calling. The cost-adjusted trade rate $ρ_{\\text{cost}}$ between gated synthetic and human-authored RLVR tasks stays in $[1.4\\times, 11.6\\times]$ across the plausible $c_{\\text{human}}/c_{\\text{aug}}$ range.","short_abstract":"The supply of high-quality training tasks is a central bottleneck for reinforcement learning from verifiable rewards (RLVR) on agentic language models. Each task requires a sandboxed setup, a prompt, and a hand-authored reward function, and only tasks that pass a quality bar produce useful training signal. Hand-curatio...","url_abs":"https://arxiv.org/abs/2606.03800","url_pdf":"https://arxiv.org/pdf/2606.03800v1","authors":"[\"Akshansh \\u003clast\\u003e\",\"Leonardo Rosa Rodrigues\",\"Michael Korostelev\",\"Youssef Hassan\",\"Mark E. Whiting\"]","published":"2026-06-02T15:48:28Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Reinforcement Learning\",\"Language Model\"]","has_code":false}
