{"ID":2834107,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.02936","arxiv_id":"2512.02936","title":"From Administrative Chaos to Analytical Cohorts: A Three-Stage Normalisation Pipeline for Longitudinal University Administrative Records","abstract":"The growing use of longitudinal university administrative records in data-driven decision-making often overlooks a critical layer: how raw, inconsistent data are normalised before modelling. This article presents a three-stage normalisation pipeline for a dataset of 24,133 engineering students at a Latin American public university, spanning four decades (1980-2019). The pipeline comprises: (i) N1 CENSAL, harmonising demographics into a single person-level layer; (ii) N1b IDENTITY RESOLUTION, consolidating duplicate identifiers into a canonical ID while preserving an audit trail; and (iii) N1c GEO and SECONDARY-SCHOOL NORMALISATION, which builds reference tables, classifies school types (state national, state provincial, private secular, private religious), and flags irrecoverable cases as DATA_MISSING. The pipeline preserves 100% of students, achieves full geocoding, and yields valid school types for 56.6% of the population. The remaining 43.4% are identified as structurally missing due to legacy enrolment practices rather than stochastic non-response. Forensic analysis (chi-square, logistic regression) shows missingness is highly predictable from entry decade and geography, confirming a structural, historically induced mechanism. The article contributes: (a) a transparent, reproducible normalisation pipeline tailored to higher education; (b) a framework for treating structurally missing information without speculative imputation; and (c) guidance on defining analytically coherent cohorts (full population vs. secondary-school-informed subcohorts) for downstream learning analytics and policy evaluation.","short_abstract":"The growing use of longitudinal university administrative records in data-driven decision-making often overlooks a critical layer: how raw, inconsistent data are normalised before modelling. This article presents a three-stage normalisation pipeline for a dataset of 24,133 engineering students at a Latin American publi...","url_abs":"https://arxiv.org/abs/2512.02936","url_pdf":"https://arxiv.org/pdf/2512.02936v1","authors":"[\"H. R. Paz\"]","published":"2025-12-02T17:02:08Z","proceeding":"cs.DB","tasks":"[\"cs.DB\"]","methods":"[]","has_code":false}
