{"ID":2830056,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.11912","arxiv_id":"2512.11912","title":"Robustness of Probabilistic Models to Low-Quality Data: A Multi-Perspective Analysis","abstract":"A systematic, comparative investigation into the effects of low-quality data reveals a stark spectrum of robustness across modern probabilistic models. We find that autoregressive language models, from token prediction to sequence-to-sequence tasks, are remarkably resilient (for GPT-2, test NLL increases modestly from 2.87 to 3.59 despite 50% token corruption). By contrast, under the same levels of data corruption, class-conditional diffusion models degrade catastrophically (image-label consistency plummets by 56.81% relative to baseline), while classifiers show a moderate impact that diminishes with dataset scale. To explain these discrepancies, we analyze the results through a multi-perspective lens, integrating information theory, PAC learning, and gradient dynamics. These analyses suggest that robustness is heavily influenced by two key principles: the richness of conditioning information, which constrains the learning problem, and the absolute information content of the training data, which allows the signal from correct information to dominate statistical noise.","short_abstract":"A systematic, comparative investigation into the effects of low-quality data reveals a stark spectrum of robustness across modern probabilistic models. We find that autoregressive language models, from token prediction to sequence-to-sequence tasks, are remarkably resilient (for GPT-2, test NLL increases modestly from...","url_abs":"https://arxiv.org/abs/2512.11912","url_pdf":"https://arxiv.org/pdf/2512.11912v1","authors":"[\"Liu Peng\",\"Yaochu Jin\"]","published":"2025-12-11T02:10:41Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Diffusion Model\",\"Language Model\"]","has_code":false}
