{"ID":2863804,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.25084","arxiv_id":"2509.25084","title":"Scaling Generalist Data-Analytic Agents","abstract":"Data-analytic agents are emerging as a key catalyst for automated scientific discovery and for the vision of Innovating AI. Current approaches, however, rely heavily on prompt engineering over proprietary models, while open-source models struggle to face diverse-format, large-scale data files and long-horizon, multi-step reasoning that real-world analytics demands. This paper introduces DataMind, a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents. DataMind tackles three key challenges in building open-source data-analytic agents, including insufficient data resources, improper training strategy, and unstable code-based multi-turn rollout. Concretely, DataMind applies 1) a fine-grained task taxonomy and a recursive easy-to-hard task composition mechanism to increase the diversity and difficulty of synthesized queries; 2) a knowledge-augmented trajectory sampling strategy followed by model-based and rule-based filtering; 3) a dynamically adjustable training objective combining both SFT and RL losses; 4) a memory-frugal and stable code-based multi-turn rollout framework. Built on DataMind, we curate DataMind-12K, a high-quality trajectory set spanning diverse domains, task categories, and data file formats for data-analytic tasks. Trained on DataMind-12K, our DataMind-14B achieves state-of-the-art with an average score of 71.16% on multiple data analysis benchmarks, outperforming the strongest proprietary baselines DeepSeek-V3.1 and GPT-5. Our DataMind-7B also performs best among all open-source models with a score of 68.10%. We also incorporate some empirical insights gained from our exploratory trials into the analysis experiments, aiming to provide actionable insights about agentic training for the community. We will release DataMind-12K and DataMind-7B,14B for the community's future research.","short_abstract":"Data-analytic agents are emerging as a key catalyst for automated scientific discovery and for the vision of Innovating AI. Current approaches, however, rely heavily on prompt engineering over proprietary models, while open-source models struggle to face diverse-format, large-scale data files and long-horizon, multi-st...","url_abs":"https://arxiv.org/abs/2509.25084","url_pdf":"https://arxiv.org/pdf/2509.25084v3","authors":"[\"Shuofei Qiao\",\"Yanqiu Zhao\",\"Zhisong Qiu\",\"Xiaobin Wang\",\"Jintian Zhang\",\"Zhao Bin\",\"Ningyu Zhang\",\"Yong Jiang\",\"Pengjun Xie\",\"Fei Huang\",\"Huajun Chen\"]","published":"2025-09-29T17:23:08Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.IR\",\"cs.LG\"]","methods":"[\"LoRA\"]","has_code":false}
