{"ID":2847435,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.27168","arxiv_id":"2510.27168","title":"ShapleyPipe: Hierarchical Shapley Search for Data Preparation Pipeline Construction","abstract":"Automated data preparation pipeline construction is critical for machine learning success, yet existing methods suffer from two fundamental limitations: they treat pipeline construction as black-box optimization without quantifying individual operator contributions, and they struggle with the combinatorial explosion of the search space ($N^M$ configurations for N operators and pipeline length M). We introduce ShapleyPipe, a principled framework that leverages game-theoretic Shapley values to systematically quantify each operator's marginal contribution while maintaining full interpretability. Our key innovation is a hierarchical decomposition that separates category-level structure search from operator-level refinement, reducing the search complexity from exponential to polynomial. To make Shapley computation tractable, we develop: (1) a Multi-Armed Bandit mechanism for intelligent category evaluation with provable convergence guarantees, and (2) Permutation Shapley values to correctly capture position-dependent operator interactions. Extensive evaluation on 18 diverse datasets demonstrates that ShapleyPipe achieves 98.1\\% of high-budget baseline performance while using 24\\% fewer evaluations, and outperforms the state-of-the-art reinforcement learning method by 3.6\\%. Beyond performance gains, ShapleyPipe provides interpretable operator valuations ($ρ$=0.933 correlation with empirical performance) that enable data-driven pipeline analysis and systematic operator library refinement.","short_abstract":"Automated data preparation pipeline construction is critical for machine learning success, yet existing methods suffer from two fundamental limitations: they treat pipeline construction as black-box optimization without quantifying individual operator contributions, and they struggle with the combinatorial explosion of...","url_abs":"https://arxiv.org/abs/2510.27168","url_pdf":"https://arxiv.org/pdf/2510.27168v1","authors":"[\"Jing Chang\",\"Chang Liu\",\"Jinbin Huang\",\"Shuyuan Zheng\",\"Rui Mao\",\"Jianbin Qin\"]","published":"2025-10-31T04:37:16Z","proceeding":"cs.DB","tasks":"[\"cs.DB\"]","methods":"[\"Reinforcement Learning\"]","has_code":false}