{"ID":2831953,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.06699","arxiv_id":"2512.06699","title":"Predictive Modeling of I/O Performance for Machine Learning Training Pipelines: A Data-Driven Approach to Storage Optimization","abstract":"Modern machine learning training is increasingly bottlenecked by data I/O rather than compute. GPUs often sit idle at below 50% utilization waiting for data. This paper presents a machine learning approach to predict I/O performance and recommend optimal storage configurations for ML training pipelines. We collected 141 observations through systematic benchmarking across different storage backends (NVMe SSD, network-attached storage, in-memory filesystems), data formats, and access patterns, covering both low-level I/O operations and full training pipelines. After evaluating seven regression models and three classification approaches, XGBoost achieved the best performance with R-squared of 0.991, predicting I/O throughput within 11.8% error on average. Feature importance analysis revealed that throughput metrics and batch size are the primary performance drivers. This data-driven approach can reduce configuration time from days of trial-and-error to minutes of predictive recommendation. The methodology is reproducible and extensible to other resource management problems in ML systems. Code and data are available at https://github.com/knkarthik01/gpu_storage_ml_project","short_abstract":"Modern machine learning training is increasingly bottlenecked by data I/O rather than compute. GPUs often sit idle at below 50% utilization waiting for data. This paper presents a machine learning approach to predict I/O performance and recommend optimal storage configurations for ML training pipelines. We collected 14...","url_abs":"https://arxiv.org/abs/2512.06699","url_pdf":"https://arxiv.org/pdf/2512.06699v2","authors":"[\"Karthik Prabhakar\",\"Durgamadhab Mishra\"]","published":"2025-12-07T07:25:08Z","proceeding":"cs.PF","tasks":"[\"cs.PF\",\"cs.AI\",\"cs.LG\"]","methods":"[]","has_code":false,"code_links":[{"ID":606180,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2831953,"paper_url":"https://arxiv.org/abs/2512.06699","paper_title":"Predictive Modeling of I/O Performance for Machine Learning Training Pipelines: A Data-Driven Approach to Storage Optimization","repo_url":"https://github.com/knkarthik01/gpu_storage_ml_project","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}