{"ID":2882867,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.09894","arxiv_id":"2508.09894","title":"Rare anomalies require large datasets: About proving the existence of anomalies","abstract":"Detecting whether any anomalies exist within a dataset is crucial for effective anomaly detection, yet it remains surprisingly underexplored in anomaly detection literature. This paper presents a comprehensive study that addresses the fundamental question: When can we conclusively determine that anomalies are present? Through extensive experimentation involving over three million statistical tests across various anomaly detection tasks and algorithms, we identify a relationship between the dataset size, contamination rate, and an algorithm-dependent constant $ α_{\\text{algo}} $. Our results demonstrate that, for an unlabeled dataset of size $ N $ and contamination rate $ ν$, the condition $ N \\ge \\frac{α_{\\text{algo}}}{ν^2} $ represents a lower bound on the number of samples required to confirm anomaly existence. This threshold implies a limit to how rare anomalies can be before proving their existence becomes infeasible.","short_abstract":"Detecting whether any anomalies exist within a dataset is crucial for effective anomaly detection, yet it remains surprisingly underexplored in anomaly detection literature. This paper presents a comprehensive study that addresses the fundamental question: When can we conclusively determine that anomalies are present?...","url_abs":"https://arxiv.org/abs/2508.09894","url_pdf":"https://arxiv.org/pdf/2508.09894v1","authors":"[\"Simon Klüttermann\",\"Emmanuel Müller\"]","published":"2025-08-13T15:52:33Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[]","has_code":false}
