{"ID":2850130,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.22816","arxiv_id":"2510.22816","title":"$L_p$ Sampling in Distributed Data Streams with Applications to Adversarial Robustness","abstract":"In the distributed monitoring model, a data stream over a universe of size $n$ is distributed over $k$ servers, who must continuously provide certain statistics of the overall dataset, while minimizing communication with a central coordinator. In such settings, the ability to efficiently collect a random sample from the global stream is a powerful primitive, enabling a wide array of downstream tasks such as estimating frequency moments, detecting heavy hitters, or performing sparse recovery. Of particular interest is the task of producing a perfect $L_p$ sample, which given a frequency vector $f \\in \\mathbb{R}^n$, outputs an index $i$ with probability $\\frac{f_i^p}{\\|f\\|_p^p}+\\frac{1}{\\mathrm{poly}(n)}$. In this paper, we resolve the problem of perfect $L_p$ sampling for all $p\\ge 1$ in the distributed monitoring model. Specifically, our algorithm runs in $k^{p-1} \\cdot \\mathrm{polylog}(n)$ bits of communication, which is optimal up to polylogarithmic factors. Utilizing our perfect $L_p$ sampler, we achieve adversarially-robust distributed monitoring protocols for the $F_p$ moment estimation problem, where the goal is to provide a $(1+\\varepsilon)$-approximation to $f_1^p+\\ldots+f_n^p$. Our algorithm uses $\\frac{k^{p-1}}{\\varepsilon^2}\\cdot\\mathrm{polylog}(n)$ bits of communication for all $p\\ge 2$ and achieves optimal bounds up to polylogarithmic factors, matching lower bounds by Woodruff and Zhang (STOC 2012) in the non-robust setting. Finally, we apply our framework to achieve near-optimal adversarially robust distributed protocols for central problems such as counting, frequency estimation, heavy-hitters, and distinct element estimation.","short_abstract":"In the distributed monitoring model, a data stream over a universe of size $n$ is distributed over $k$ servers, who must continuously provide certain statistics of the overall dataset, while minimizing communication with a central coordinator. In such settings, the ability to efficiently collect a random sample from th...","url_abs":"https://arxiv.org/abs/2510.22816","url_pdf":"https://arxiv.org/pdf/2510.22816v1","authors":"[\"Honghao Lin\",\"Zhao Song\",\"David P. Woodruff\",\"Shenghao Xie\",\"Samson Zhou\"]","published":"2025-10-26T20:06:16Z","proceeding":"cs.DS","tasks":"[\"cs.DS\"]","methods":"[]","has_code":false}
