{"ID":2825352,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.20924","arxiv_id":"2512.20924","title":"Clever Hans in Chemistry: Chemist Style Signals Confound Activity Prediction on Public Benchmarks","abstract":"Can machine learning models identify which chemist made a molecule from structure alone? If so, models trained on literature data may exploit chemist intent rather than learning causal structure-activity relationships. We test this by linking CHEMBL assays to publication authors and training a 1,815-class classifier to predict authors from molecular fingerprints, achieving 60% top-5 accuracy under scaffold-based splitting. We then train an activity model that receives only a protein identifier and an author-probability vector derived from structure, with no direct access to molecular descriptors. This author-only model achieves predictive power comparable to a simple baseline that has access to structure. This reveals a \"Clever Hans\" failure mode: models can predict bioactivity largely by inferring chemist goals and favorite targets without requiring a lab-independent understanding of chemistry. We analyze the sources of this leakage, propose author-disjoint splits, and recommend dataset practices to decouple chemist intent from biological outcomes.","short_abstract":"Can machine learning models identify which chemist made a molecule from structure alone? If so, models trained on literature data may exploit chemist intent rather than learning causal structure-activity relationships. We test this by linking CHEMBL assays to publication authors and training a 1,815-class classifier to...","url_abs":"https://arxiv.org/abs/2512.20924","url_pdf":"https://arxiv.org/pdf/2512.20924v1","authors":"[\"Andrew D. Blevins\",\"Ian K. Quigley\"]","published":"2025-12-24T04:04:20Z","proceeding":"q-bio.BM","tasks":"[\"q-bio.BM\",\"cs.LG\",\"physics.chem-ph\"]","methods":"[]","has_code":false}
