{"ID":2888910,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.23018","arxiv_id":"2507.23018","title":"Data Readiness for Scientific AI at Scale","abstract":"This paper examines how Data Readiness for AI (DRAI) principles apply to leadership-scale scientific datasets used to train foundation models. We analyze archetypal workflows across four representative domains - climate, nuclear fusion, bio/health, and materials - to identify common preprocessing patterns and domain-specific constraints. We introduce a two-dimensional readiness framework composed of Data Readiness Levels (raw to AI-ready) and Data Processing Stages (ingest to shard), both tailored to high performance computing (HPC) environments. This framework outlines key challenges in transforming scientific data for scalable AI training, emphasizing transformer-based generative models. Together, these dimensions form a conceptual maturity matrix that characterizes scientific data readiness and guides infrastructure development toward standardized, cross-domain support for scalable and reproducible AI for science.","short_abstract":"This paper examines how Data Readiness for AI (DRAI) principles apply to leadership-scale scientific datasets used to train foundation models. We analyze archetypal workflows across four representative domains - climate, nuclear fusion, bio/health, and materials - to identify common preprocessing patterns and domain-sp...","url_abs":"https://arxiv.org/abs/2507.23018","url_pdf":"https://arxiv.org/pdf/2507.23018v1","authors":"[\"Wesley Brewer\",\"Patrick Widener\",\"Valentine Anantharaj\",\"Feiyi Wang\",\"Tom Beck\",\"Arjun Shankar\",\"Sarp Oral\"]","published":"2025-07-30T18:30:37Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CE\",\"cs.DC\",\"cs.LG\"]","methods":"[\"Transformer\"]","has_code":false}