{"ID":2872227,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.10572","arxiv_id":"2509.10572","title":"Quality Assessment of Tabular Data using Large Language Models and Code Generation","abstract":"Reliable data quality is crucial for downstream analysis of tabular datasets, yet rule-based validation often struggles with inefficiency, human intervention, and high computational costs. We present a three-stage framework that combines statistical inliner detection with LLM-driven rule and code generation. After filtering data samples through traditional clustering, we iteratively prompt LLMs to produce semantically valid quality rules and synthesize their executable validators through code-generating LLMs. To generate reliable quality rules, we aid LLMs with retrieval-augmented generation (RAG) by leveraging external knowledge sources and domain-specific few-shot examples. Robust guardrails ensure the accuracy and consistency of both rules and code snippets. Extensive evaluations on benchmark datasets confirm the effectiveness of our approach.","short_abstract":"Reliable data quality is crucial for downstream analysis of tabular datasets, yet rule-based validation often struggles with inefficiency, human intervention, and high computational costs. We present a three-stage framework that combines statistical inliner detection with LLM-driven rule and code generation. After filt...","url_abs":"https://arxiv.org/abs/2509.10572","url_pdf":"https://arxiv.org/pdf/2509.10572v2","authors":"[\"Ashlesha Akella\",\"Akshar Kaul\",\"Krishnasuri Narayanam\",\"Sameep Mehta\"]","published":"2025-09-11T14:17:42Z","proceeding":"cs.SE","tasks":"[\"cs.SE\",\"cs.AI\",\"cs.DB\"]","methods":"[\"RAG\",\"Large Language Model\",\"Language Model\"]","has_code":false}
