{"ID":2892685,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.15112","arxiv_id":"2507.15112","title":"Distributional Machine Unlearning via Selective Data Removal","abstract":"Machine learning systems increasingly face requirements to remove entire domains of information--such as toxic language or biases--rather than individual user data. This task presents a dilemma: full removal of the unwanted domain data is computationally expensive, while random partial removal is statistically inefficient. We find that a domain's statistical influence is often concentrated in a small subset of its data samples, suggesting a path between ineffective partial removal and unnecessary complete removal. We formalize this as distributional unlearning: a framework to select a small subset that balances forgetting an unwanted distribution while preserving a desired one. Using Kullback-Leibler divergence constraints, we derive the exact removal-preservation Pareto frontier for Gaussian distributions and prove that models trained on the edited data achieve corresponding log-loss bounds. We propose a distance-based selection algorithm and show it is quadratically more sample-efficient than random removal in the challenging low-divergence regime. Experiments across synthetic, text, and image datasets (Jigsaw, CIFAR-10, SMS spam) show our method requires 15-82% less deletion than full removal for strong unlearning effects, e.g., halving initial forget set accuracy. Ultimately, by showing a small forget set often suffices, our framework lays the foundations for more scalable and rigorous subpopulation unlearning.","short_abstract":"Machine learning systems increasingly face requirements to remove entire domains of information--such as toxic language or biases--rather than individual user data. This task presents a dilemma: full removal of the unwanted domain data is computationally expensive, while random partial removal is statistically ineffici...","url_abs":"https://arxiv.org/abs/2507.15112","url_pdf":"https://arxiv.org/pdf/2507.15112v4","authors":"[\"Youssef Allouah\",\"Rachid Guerraoui\",\"Sanmi Koyejo\"]","published":"2025-07-20T20:21:23Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.CR\",\"stat.ML\"]","methods":"[]","has_code":false}