{"ID":2862810,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.26291","arxiv_id":"2509.26291","title":"Representation-Based Data Quality Audits for Audio","abstract":"Data quality issues such as off-topic samples, near duplicates, and label errors often limit the performance of audio-based systems. This paper addresses these issues by adapting SelfClean, a representation-to-rank data auditing framework, from the image to the audio domain. This approach leverages self-supervised audio representations to identify common data quality issues, creating ranked review lists that surface distinct issues within a single, unified process. The method is benchmarked on the ESC-50, GTZAN, and a proprietary industrial dataset, using both synthetic and naturally occurring corruptions. The results demonstrate that this framework achieves state-of-the-art ranking performance, often outperforming issue-specific baselines and enabling significant annotation savings by efficiently guiding human review.","short_abstract":"Data quality issues such as off-topic samples, near duplicates, and label errors often limit the performance of audio-based systems. This paper addresses these issues by adapting SelfClean, a representation-to-rank data auditing framework, from the image to the audio domain. This approach leverages self-supervised audi...","url_abs":"https://arxiv.org/abs/2509.26291","url_pdf":"https://arxiv.org/pdf/2509.26291v1","authors":"[\"Alvaro Gonzalez-Jimenez\",\"Fabian Gröger\",\"Linda Wermelinger\",\"Andrin Bürli\",\"Iason Kastanis\",\"Simone Lionetti\",\"Marc Pouly\"]","published":"2025-09-30T14:08:03Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.AI\",\"cs.LG\"]","methods":"[]","has_code":false}