{"ID":2823341,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2601.00995","arxiv_id":"2601.00995","title":"Grain Theory: Type-Level Granularity Correctness in Data Pipelines","abstract":"Data transformation correctness is a fundamental challenge in data engineering: how can we verify that pipelines produce correct results before executing on production data? Existing practice relies on iterative testing over materialized data. A common cause of errors is the absence of formal reasoning about grain -- the level of detail of data -- so transformations inadvertently change granularity, yielding pathologies like fan traps (metric duplication) and chasm traps (data loss). We introduce grain theory, a type-theoretic framework that elevates grain to a composable property of any algebraic data type. It has two phases. First, a denotation of data: grain itself, defined by irreducibility and isomorphism, with no reference to functional dependencies; three grain relations forming a bounded lattice whose axioms recover Armstrong's on product types; the entity key as a derived grain; and grain-determined behavioral classes -- together the type-level triple (G[R], EK[R], BC[R]). Second, a denotation of transformations: every transformation $h$ has a grain lift $\\varphi(h)$. For collections of product types under the relational algebra we prove an equi-join grain inference theorem and present CalcG, a decidable algorithm that composes grain lifts across a pipeline DAG. The central theorem -- the grain homomorphism -- ties the phases together: grain projection commutes with transformation, and grain lifts compose ($\\varphi(h_2 \\circ h_1) = \\varphi(h_2) \\circ \\varphi(h_1)$). Grain-correctness is therefore verifiable at design time, before any code or query runs. As corollaries, fan traps emerge as schema-detectable grain-relation violations; chasm traps localize to a specific ordering-chain pattern; and behavioral-class violations, such as point-in-time queries on the wrong collection type, become compile-time type errors. All theorems are mechanically verified in Lean 4.","short_abstract":"Data transformation correctness is a fundamental challenge in data engineering: how can we verify that pipelines produce correct results before executing on production data? Existing practice relies on iterative testing over materialized data. A common cause of errors is the absence of formal reasoning about grain -- t...","url_abs":"https://arxiv.org/abs/2601.00995","url_pdf":"https://arxiv.org/pdf/2601.00995v2","authors":"[\"Nikos Karayannidis\"]","published":"2026-01-02T22:26:31Z","proceeding":"cs.DB","tasks":"[\"cs.DB\"]","methods":"[]","has_code":false}
