{"ID":2884045,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.07179","arxiv_id":"2508.07179","title":"Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks","abstract":"Enterprise data pipelines, characterized by complex transformations across multiple programming languages, often cause a semantic disconnect between original metadata and downstream data. This \"semantic drift\" compromises data reproducibility and governance, and impairs the utility of services like retrieval-augmented generation (RAG) and text-to-SQL systems. To address this, a novel framework is proposed for the automated extraction of fine-grained schema lineage from multilingual enterprise pipeline scripts. This method identifies four key components: source schemas, source tables, transformation logic, and aggregation operations, creating a standardized representation of data transformations. For the rigorous evaluation of lineage quality, this paper introduces the Schema Lineage Composite Evaluation (SLiCE), a metric that assesses both structural correctness and semantic fidelity. A new benchmark is also presented, comprising 1,700 manually annotated lineages from real-world industrial scripts. Experiments were conducted with 12 language models, from 1.3B to 32B small language models (SLMs) to large language models (LLMs) like GPT-4o and GPT-4.1. The results demonstrate that the performance of schema lineage extraction scales with model size and the sophistication of prompting techniques. Specially, a 32B open-source model, using a single reasoning trace, can achieve performance comparable to the GPT series under standard prompting. This finding suggests a scalable and economical approach for deploying schema-aware agents in practical applications.","short_abstract":"Enterprise data pipelines, characterized by complex transformations across multiple programming languages, often cause a semantic disconnect between original metadata and downstream data. This \"semantic drift\" compromises data reproducibility and governance, and impairs the utility of services like retrieval-augmented...","url_abs":"https://arxiv.org/abs/2508.07179","url_pdf":"https://arxiv.org/pdf/2508.07179v1","authors":"[\"Jiaqi Yin\",\"Yi-Wei Chen\",\"Meng-Lung Lee\",\"Xiya Liu\"]","published":"2025-08-10T05:04:32Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.DB\"]","methods":"[\"RAG\",\"Large Language Model\",\"Language Model\"]","has_code":false}
