{"ID":2880433,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.15105","arxiv_id":"2508.15105","title":"Declarative Data Pipeline for Large Scale ML Services","abstract":"Modern distributed data processing systems struggle to balance performance, maintainability, and developer productivity when integrating machine learning at scale. These challenges intensify in large collaborative environments due to high communication overhead and coordination complexity. We present a \"Declarative Data Pipeline\" (DDP) architecture that addresses these challenges while processing billions of records efficiently. Our modular framework seamlessly integrates machine learning within Apache Spark using logical computation units called Pipes, departing from traditional microservice approaches. By establishing clear component boundaries and standardized interfaces, we achieve modularity and optimization without sacrificing maintainability. Enterprise case studies demonstrate substantial improvements: 50% better development efficiency, collaboration efforts compressed from weeks to days, 500x scalability improvement, and 10x throughput gains.","short_abstract":"Modern distributed data processing systems struggle to balance performance, maintainability, and developer productivity when integrating machine learning at scale. These challenges intensify in large collaborative environments due to high communication overhead and coordination complexity. We present a \"Declarative Dat...","url_abs":"https://arxiv.org/abs/2508.15105","url_pdf":"https://arxiv.org/pdf/2508.15105v4","authors":"[\"Yunzhao Yang\",\"Runhui Wang\",\"Xuanqing Liu\",\"Adit Krishnan\",\"Yefan Tao\",\"Yuqian Deng\",\"Kuangyou Yao\",\"Peiyuan Sun\",\"Henrik Johnson\",\"Aditi sinha\",\"Davor Golac\",\"Gerald Friedland\",\"Usman Shakeel\",\"Daryl Cooke\",\"Joe Sullivan\",\"Madhusudhanan Chandrasekaran\",\"Chris Kong\"]","published":"2025-08-20T22:44:58Z","proceeding":"cs.DC","tasks":"[\"cs.DC\"]","methods":"[]","has_code":false}
