Methodological Precedence in Health Tech: Why ML/Big Data Analysis Must Follow Basic Epidemiological Consistency. A Case Study

Nov 10, 2025 cs.LG arXiv:2511.07500

Abstract

The integration of advanced analytical tools, including Machine Learning (ML) and massive data processing, has revolutionized health research, promising unprecedented accuracy in diagnosis and risk prediction. However, the rigor of these complex methods is fundamentally dependent on the quality and integrity of the underlying datasets and the validity of their statistical design. We propose an emblematic case where advanced analysis (ML/Big Data) must necessarily be subsequent to the verification of basic methodological coherence and adherence to established medical protocols, such as the STROBE Statement. This study highlights a crucial cautionary principle: sophisticated analyses amplify, rather than correct, severe methodological flaws rooted in basic design choices, leading to misleading or contradictory findings. By applying simple, standard descriptive statistical methods and established national epidemiological benchmarks to a recently published cohort study on COVID-19 vaccine outcomes and severe adverse events, like cancer, we expose multiple, statistically irreconcilable paradoxes. These paradoxes, specifically the contradictory finding of an increased cancer incidence within an exposure subgroup, concurrent with a suppressed overall Crude Incidence Rate compared to national standards, definitively invalidate the reported risk of increased cancer in the total population. We demonstrate that the observed effects are mathematical artifacts stemming from an uncorrected selection bias in the cohort construction. This analysis serves as a robust reminder that even the most complex health studies must first pass the test of basic epidemiological consistency before any conclusion drawn from subsequent advanced statistical modeling can be considered valid or publishable.

Abstract

PDF Viewer