{"ID":3053264,"CreatedAt":"2026-06-04T04:41:36.695875263Z","UpdatedAt":"2026-06-05T22:02:57.54306338Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.04261","arxiv_id":"2606.04261","title":"Can Generalist Agents Automate Data Curation?","abstract":"Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.","short_abstract":"Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curati...","url_abs":"https://arxiv.org/abs/2606.04261","url_pdf":"https://arxiv.org/pdf/2606.04261v1","authors":"[\"Feiyang Kang\",\"Hanze Li\",\"Adam Nguyen\",\"Mahavir Dabas\",\"Jiaqi W. Ma\",\"Frederic Sala\",\"Dawn Song\",\"Ruoxi Jia\"]","published":"2026-06-02T22:26:53Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CL\",\"cs.CV\",\"cs.ET\",\"cs.LG\"]","methods":"[\"LoRA\"]","has_code":false}