{"ID":2855442,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.14145","arxiv_id":"2510.14145","title":"High-Dimensional BWDM: A Robust Nonparametric Clustering Validation Index for Large-Scale Data","abstract":"Determining the appropriate number of clusters in unsupervised learning is a central problem in statistics and data science. Traditional validity indices such as Calinski-Harabasz, Silhouette, and Davies-Bouldin-depend on centroid-based distances and therefore degrade in high-dimensional or contaminated data. This paper proposes a new robust, nonparametric clustering validation framework, the High-Dimensional Between-Within Distance Median (HD-BWDM), which extends the recently introduced BWDM criterion to high-dimensional spaces. HD-BWDM integrates random projection and principal component analysis to mitigate the curse of dimensionality and applies trimmed clustering and medoid-based distances to ensure robustness against outliers. We derive theoretical results showing consistency and convergence under Johnson-Lindenstrauss embeddings. Extensive simulations demonstrate that HD-BWDM remains stable and interpretable under high-dimensional projections and contamination, providing a robust alternative to traditional centroid-based validation criteria. The proposed method provides a theoretically grounded, computationally efficient stopping rule for nonparametric clustering in modern high-dimensional applications.","short_abstract":"Determining the appropriate number of clusters in unsupervised learning is a central problem in statistics and data science. Traditional validity indices such as Calinski-Harabasz, Silhouette, and Davies-Bouldin-depend on centroid-based distances and therefore degrade in high-dimensional or contaminated data. This pape...","url_abs":"https://arxiv.org/abs/2510.14145","url_pdf":"https://arxiv.org/pdf/2510.14145v1","authors":"[\"Mohammed Baragilly\",\"Hend Gabr\"]","published":"2025-10-15T22:25:25Z","proceeding":"stat.ML","tasks":"[\"stat.ML\",\"cs.LG\"]","methods":"[]","has_code":false}
