{"ID":2837736,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.19399","arxiv_id":"2511.19399","title":"DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research","abstract":"Deep research agents perform multi-step research to produce long-form, well-attributed answers. However, most open deep research agents are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards, which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), where rubrics are constructed and maintained to co-evolve with the policy model during training. This allows the rubrics to incorporate newly explored information from search and contrasting model responses, enabling better fact checking and more discriminative on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first fully open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare, and general domains, DR Tulu substantially outperforms existing open deep research agents (by 15.6% over Tongyi DR on average) and matches or exceeds proprietary deep research agents (by 0.7% over OpenAI DR on average), while being significantly smaller and cheaper per query (1000x cheaper than OpenAI DR per query).","short_abstract":"Deep research agents perform multi-step research to produce long-form, well-attributed answers. However, most open deep research agents are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards, which does not extend to realistic long-form tasks. We address this with Reinfo...","url_abs":"https://arxiv.org/abs/2511.19399","url_pdf":"https://arxiv.org/pdf/2511.19399v3","authors":"[\"Rulin Shao\",\"Akari Asai\",\"Shannon Zejiang Shen\",\"Hamish Ivison\",\"Varsha Kishore\",\"Jingming Zhuo\",\"Xinran Zhao\",\"Molly Park\",\"Samuel G. Finlayson\",\"David Sontag\",\"Tyler Murray\",\"Sewon Min\",\"Pradeep Dasigi\",\"Luca Soldaini\",\"Faeze Brahman\",\"Wen-tau Yih\",\"Tongshuang Wu\",\"Luke Zettlemoyer\",\"Yoon Kim\",\"Hannaneh Hajishirzi\",\"Pang Wei Koh\"]","published":"2025-11-24T18:35:54Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Reinforcement Learning\"]","has_code":false}
