{"ID":2841192,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.12025","arxiv_id":"2511.12025","title":"A Quick and Exact Method for Distributed Quantile Computation","abstract":"Quantile computation is a core primitive in large-scale data analytics. In Spark, practitioners typically rely on the Greenwald-Khanna (GK) Sketch, an approximate method. When exact quantiles are required, the default option is an expensive global sort. We present GK Select, an exact Spark algorithm that avoids full-data shuffles and completes in a constant number of actions. GK Select leverages GK Sketch to identify a near-target pivot, extracts all values within the error bound around this pivot in each partition in linear time, and then tree-reduces the resulting candidate sets. We show analytically that GK Select matches the executor-side time complexity of GK Sketch while returning the exact quantile. Empirically, GK Select achieves sketch-level latency and outperforms Spark's full sort by approximately 10.5x on 10^9 values across 120 partitions on a 30-core AWS EMR cluster.","short_abstract":"Quantile computation is a core primitive in large-scale data analytics. In Spark, practitioners typically rely on the Greenwald-Khanna (GK) Sketch, an approximate method. When exact quantiles are required, the default option is an expensive global sort. We present GK Select, an exact Spark algorithm that avoids full-da...","url_abs":"https://arxiv.org/abs/2511.12025","url_pdf":"https://arxiv.org/pdf/2511.12025v2","authors":"[\"Ivan Cao\",\"Jaromir J. Saloni\",\"David A. G. Harrison\"]","published":"2025-11-15T04:26:11Z","proceeding":"cs.DC","tasks":"[\"cs.DC\"]","methods":"[]","has_code":false}
