{"ID":2844237,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.06185","arxiv_id":"2511.06185","title":"Dataforge: Agentic Platform for Autonomous Data Engineering","abstract":"The growing demand for artificial intelligence (AI) applications in materials discovery, molecular modeling, and climate science has made data preparation a critical but labor-intensive bottleneck. Raw data from diverse sources must be cleaned, normalized, and transformed to become AI-ready, where effective feature transformation and selection are essential for robust learning. We present Dataforge, an LLM-powered agentic data engineering platform for tabular data that is automatic, safe, and non-expert friendly. It autonomously performs data cleaning and iteratively optimizes feature operations under a budgeted feedback loop with automatic stopping. Across tabular benchmarks, it achieves the best overall downstream performance; ablations further confirm the roles of routing/iterative refinement and grounding in accuracy and reliability. Dataforge demonstrates a practical path toward autonomous data agents that transform raw data from data to better data.","short_abstract":"The growing demand for artificial intelligence (AI) applications in materials discovery, molecular modeling, and climate science has made data preparation a critical but labor-intensive bottleneck. Raw data from diverse sources must be cleaned, normalized, and transformed to become AI-ready, where effective feature tra...","url_abs":"https://arxiv.org/abs/2511.06185","url_pdf":"https://arxiv.org/pdf/2511.06185v2","authors":"[\"Xinyuan Wang\",\"Hongyu Cao\",\"Kunpeng Liu\",\"Yanjie Fu\"]","published":"2025-11-09T01:58:13Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Large Language Model\"]","has_code":false}
