{"ID":2890757,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.18101","arxiv_id":"2507.18101","title":"Large-scale entity resolution via microclustering Ewens--Pitman random partitions","abstract":"We introduce the microclustering Ewens--Pitman model for random partitions, obtained by scaling the strength parameter of the Ewens--Pitman model linearly with the sample size. The resulting random partition is shown to have the microclustering property, namely: the size of the largest cluster grows sub-linearly with the sample size, while the number of clusters grows linearly. By leveraging the interplay between the Ewens--Pitman random partition with the Pitman--Yor process, we develop efficient variational inference schemes for posterior computation in entity resolution. Our approach achieves a speed-up of three orders of magnitude over existing Bayesian methods for entity resolution, while maintaining competitive empirical performance.","short_abstract":"We introduce the microclustering Ewens--Pitman model for random partitions, obtained by scaling the strength parameter of the Ewens--Pitman model linearly with the sample size. The resulting random partition is shown to have the microclustering property, namely: the size of the largest cluster grows sub-linearly with t...","url_abs":"https://arxiv.org/abs/2507.18101","url_pdf":"https://arxiv.org/pdf/2507.18101v1","authors":"[\"Mario Beraha\",\"Stefano Favaro\"]","published":"2025-07-24T05:28:40Z","proceeding":"stat.ME","tasks":"[\"stat.ME\",\"math.ST\",\"stat.CO\",\"stat.ML\"]","methods":"[]","has_code":false}
