{"ID":2885462,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.05901","arxiv_id":"2508.05901","title":"Estimating the size of a set using cascading exclusion","abstract":"Let $S$ be a finite set, and $X_1,\\ldots,X_n$ an i.i.d. uniform sample from $S$. To estimate the size $|S|$, without further structure, one can wait for repeats and use the birthday problem. This requires a sample size of the order $|S|^\\frac{1}{2}$. On the other hand, if $S=\\{1,2,\\ldots,|S|\\}$, the maximum of the sample blown up by $n/(n-1)$ gives an efficient estimator based on any growing sample size. This paper gives refinements that interpolate between these extremes. A general non-asymptotic theory is developed. This includes estimating the volume of a compact convex set, the unseen species problem, and a host of testing problems that follow from the question `Is this new observation a typical pick from a large prespecified population?' We also treat regression style predictors. A general theorem gives non-parametric finite $n$ error bounds in all cases.","short_abstract":"Let $S$ be a finite set, and $X_1,\\ldots,X_n$ an i.i.d. uniform sample from $S$. To estimate the size $|S|$, without further structure, one can wait for repeats and use the birthday problem. This requires a sample size of the order $|S|^\\frac{1}{2}$. On the other hand, if $S=\\{1,2,\\ldots,|S|\\}$, the maximum of the samp...","url_abs":"https://arxiv.org/abs/2508.05901","url_pdf":"https://arxiv.org/pdf/2508.05901v4","authors":"[\"Sourav Chatterjee\",\"Persi Diaconis\",\"Susan Holmes\"]","published":"2025-08-07T23:36:42Z","proceeding":"math.ST","tasks":"[\"math.ST\",\"cs.IT\",\"math.PR\",\"stat.ML\"]","methods":"[]","has_code":false}
