{"ID":3005021,"CreatedAt":"2026-06-03T03:09:48.883664427Z","UpdatedAt":"2026-06-05T06:46:15.197025399Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.03239","arxiv_id":"2606.03239","title":"ARBOR: Online Process Rewards via a Reusable Rubric Buffer for Search Agents","abstract":"LLM-based search agents are trained predominantly with outcome-only reward, leaving the search process itself unsupervised. This signal degenerates on outcome-homogeneous groups where all sampled trajectories share the same correctness, yielding zero within-group advantage and no gradient. Existing process supervision either trains a costly verifier or generates per-query rubrics that are inconsistent across queries and discarded after one use. We propose ARBOR (Adaptive Rubric Buffer for Online Reward), a reusable process-reward framework that maintains a rubric memory shared across queries. Query-local drafts induced from contrastive trajectories are admitted, consolidated into cross-query common rubrics, and retired as the policy evolves. A small active subset of common rubrics scores trajectories via sparse pairwise judging, and the resulting scores are added to the base reward, providing process-level gradient even when outcome reward is uniform. ARBOR consistently outperforms GRPO and DAPO baselines on four multi-hop QA benchmarks, raising average LLM-judge accuracy by up to 4.2 points and converting up to 42% of otherwise-zero-gradient training groups into informative ones.","short_abstract":"LLM-based search agents are trained predominantly with outcome-only reward, leaving the search process itself unsupervised. This signal degenerates on outcome-homogeneous groups where all sampled trajectories share the same correctness, yielding zero within-group advantage and no gradient. Existing process supervision...","url_abs":"https://arxiv.org/abs/2606.03239","url_pdf":"https://arxiv.org/pdf/2606.03239v1","authors":"[\"Zheng Liu\",\"Longxiang Zhang\",\"Xintong Wang\",\"Zhiang Xu\",\"Shaoxiong Zhan\",\"Xin Shan\",\"Wen Huang\",\"Tao Dai\",\"Shu-Tao Xia\",\"Chengfu Huo\",\"Liang Ding\"]","published":"2026-06-02T06:58:54Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\"]","has_code":false}
