{"ID":2831099,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.08376","arxiv_id":"2512.08376","title":"A Distribution Testing Approach to Clustering Distributions","abstract":"We study the following distribution clustering problem: Given a hidden partition of $k$ distributions into two groups, such that the distributions within each group are the same, and the two distributions associated with the two clusters are $\\varepsilon$-far in total variation, the goal is to recover the partition. We establish upper and lower bounds on the sample complexity for two fundamental cases: (1) when one of the cluster's distributions is known, and (2) when both are unknown. Our upper and lower bounds characterize the sample complexity's dependence on the domain size $n$, number of distributions $k$, size $r$ of one of the clusters, and distance $\\varepsilon$. In particular, we achieve tightness with respect to $(n,k,r,\\varepsilon)$ (up to an $O(\\log k)$ factor) for all regimes.","short_abstract":"We study the following distribution clustering problem: Given a hidden partition of $k$ distributions into two groups, such that the distributions within each group are the same, and the two distributions associated with the two clusters are $\\varepsilon$-far in total variation, the goal is to recover the partition. We...","url_abs":"https://arxiv.org/abs/2512.08376","url_pdf":"https://arxiv.org/pdf/2512.08376v2","authors":"[\"Gunjan Kumar\",\"Yash Pote\",\"Jonathan Scarlett\"]","published":"2025-12-09T09:01:41Z","proceeding":"cs.DS","tasks":"[\"cs.DS\",\"cs.IT\",\"math.ST\",\"stat.ML\"]","methods":"[]","has_code":false}
