From Administrative Chaos to Analytical Cohorts: A Three-Stage Normalisation Pipeline for Longitudinal University Administrative Records
Abstract
The growing use of longitudinal university administrative records in data-driven decision-making often overlooks a critical layer: how raw, inconsistent data are normalised before modelling. This article presents a three-stage normalisation pipeline for a dataset of 24,133 engineering students at a Latin American public university, spanning four decades (1980-2019). The pipeline comprises: (i) N1 CENSAL, harmonising demographics into a single person-level layer; (ii) N1b IDENTITY RESOLUTION, consolidating duplicate identifiers into a canonical ID while preserving an audit trail; and (iii) N1c GEO and SECONDARY-SCHOOL NORMALISATION, which builds reference tables, classifies school types (state national, state provincial, private secular, private religious), and flags irrecoverable cases as DATA_MISSING. The pipeline preserves 100% of students, achieves full geocoding, and yields valid school types for 56.6% of the population. The remaining 43.4% are identified as structurally missing due to legacy enrolment practices rather than stochastic non-response. Forensic analysis (chi-square, logistic regression) shows missingness is highly predictable from entry decade and geography, confirming a structural, historically induced mechanism. The article contributes: (a) a transparent, reproducible normalisation pipeline tailored to higher education; (b) a framework for treating structurally missing information without speculative imputation; and (c) guidance on defining analytically coherent cohorts (full population vs. secondary-school-informed subcohorts) for downstream learning analytics and policy evaluation.