{"ID":2876371,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.00293","arxiv_id":"2509.00293","title":"Illuminating Patterns of Divergence: DataDios SmartDiff for Large-Scale Data Difference Analysis","abstract":"Data engineering workflows require reliable differencing across files, databases, and query outputs, yet existing tools falter under schema drift, heterogeneous types, and limited explainability. SmartDiff is a unified system that combines schema-aware mapping, type-specific comparators, and parallel execution. It aligns evolving schemas, compares structured and semi-structured data (strings, numbers, dates, JSON/XML), and clusters results with labels that explain how and why differences occur. On multi-million-row datasets, SmartDiff achieves over 95 percent precision and recall, runs 30 to 40 percent faster, and uses 30 to 50 percent less memory than baselines; in user studies, it reduces root-cause analysis time from 10 hours to 12 minutes. An LLM-assisted labeling pipeline produces deterministic, schema-valid multilabel explanations using retrieval augmentation and constrained decoding; ablations show further gains in label accuracy and time to diagnosis over rules-only baselines. These results indicate SmartDiff's utility for migration validation, regression testing, compliance auditing, and continuous data quality monitoring. Index Terms: data differencing, schema evolution, data quality, parallel processing, clustering, explainable validation, big data","short_abstract":"Data engineering workflows require reliable differencing across files, databases, and query outputs, yet existing tools falter under schema drift, heterogeneous types, and limited explainability. SmartDiff is a unified system that combines schema-aware mapping, type-specific comparators, and parallel execution. It alig...","url_abs":"https://arxiv.org/abs/2509.00293","url_pdf":"https://arxiv.org/pdf/2509.00293v1","authors":"[\"Aryan Poduri\",\"Yashwant Tailor\"]","published":"2025-08-30T01:03:44Z","proceeding":"cs.DB","tasks":"[\"cs.DB\",\"cs.LG\"]","methods":"[\"Large Language Model\"]","has_code":false}
