{"ID":2861896,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.00552","arxiv_id":"2510.00552","title":"Data Quality Challenges in Retrieval-Augmented Generation","abstract":"Organizations increasingly adopt Retrieval-Augmented Generation (RAG) to enhance Large Language Models with enterprise-specific knowledge. However, current data quality (DQ) frameworks have been primarily developed for static datasets, and only inadequately address the dynamic, multi-stage nature of RAG systems. This study aims to develop DQ dimensions for this new type of AI-based systems. We conduct 16 semi-structured interviews with practitioners of leading IT service companies. Through a qualitative content analysis, we inductively derive 15 distinct DQ dimensions across the four processing stages of RAG systems: data extraction, data transformation, prompt \u0026 search, and generation. Our findings reveal that (1) new dimensions have to be added to traditional DQ frameworks to also cover RAG contexts; (2) these new dimensions are concentrated in early RAG steps, suggesting the need for front-loaded quality management strategies, and (3) DQ issues transform and propagate through the RAG pipeline, necessitating a dynamic, step-aware approach to quality management.","short_abstract":"Organizations increasingly adopt Retrieval-Augmented Generation (RAG) to enhance Large Language Models with enterprise-specific knowledge. However, current data quality (DQ) frameworks have been primarily developed for static datasets, and only inadequately address the dynamic, multi-stage nature of RAG systems. This s...","url_abs":"https://arxiv.org/abs/2510.00552","url_pdf":"https://arxiv.org/pdf/2510.00552v1","authors":"[\"Leopold Müller\",\"Joshua Holstein\",\"Sarah Bause\",\"Gerhard Satzger\",\"Niklas Kühl\"]","published":"2025-10-01T06:13:40Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.HC\"]","methods":"[\"RAG\",\"Language Model\",\"Generative Adversarial Network\"]","has_code":false}
