{"ID":2824364,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.23491","arxiv_id":"2512.23491","title":"SPER: Accelerating Progressive Entity Resolution via Stochastic Bipartite Maximization","abstract":"Entity Resolution (ER) is a critical data cleaning task for identifying records that refer to the same real-world entity. In the era of Big Data, traditional batch ER is often infeasible due to volume and velocity constraints, necessitating Progressive ER methods that maximize recall within a limited computational budget. However, existing progressive approaches fail to scale to high-velocity streams because they rely on deterministic sorting to prioritize candidate pairs, a process that incurs prohibitive super-linear complexity and heavy initialization costs. To address this scalability wall, we introduce SPER (Stochastic Progressive ER), a novel framework that redefines prioritization as a sampling problem rather than a ranking problem. By replacing global sorting with a continuous stochastic bipartite maximization strategy, SPER acts as a probabilistic high-pass filter that selects high-utility pairs in strictly linear time. Extensive experiments on eight real-world datasets demonstrate that SPER achieves significant speedups (3x to \u003e6x) over state-of-the-art baselines while maintaining comparable recall and precision.","short_abstract":"Entity Resolution (ER) is a critical data cleaning task for identifying records that refer to the same real-world entity. In the era of Big Data, traditional batch ER is often infeasible due to volume and velocity constraints, necessitating Progressive ER methods that maximize recall within a limited computational budg...","url_abs":"https://arxiv.org/abs/2512.23491","url_pdf":"https://arxiv.org/pdf/2512.23491v2","authors":"[\"Dimitrios Karapiperis\",\"George Papadakis\",\"Vassilios Verykios\"]","published":"2025-12-29T14:26:05Z","proceeding":"cs.DB","tasks":"[\"cs.DB\"]","methods":"[]","has_code":false}
