Data as a Lever: A Neighbouring Datasets Perspective on Predictive Multiplicity
Abstract
Multiplicity, the existence of equally good yet competing models, has received growing attention in recent years. While prior work has emphasized modelling choices, the critical role of data in shaping multiplicity has been largely overlooked. In this work, we first introduce a neighbouring datasets framework, arguing that much of data processing can be reframed as choosing between neighbouring datasets. Under this framework, we find a counterintuitive theoretical relationship: neighbouring datasets with greater inter-class distribution overlap exhibit lower multiplicity. Building on this insight, we apply our framework to two domains: active learning and data imputation. For each, we establish natural extensions of the neighbouring datasets perspective, conduct the first systematic study of multiplicity in existing algorithms, and finally, propose novel multiplicity-aware methods, namely, multiplicity-aware data acquisition strategies for active learning and multiplicity-aware data imputation.