{"ID":2892840,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.14639","arxiv_id":"2507.14639","title":"KinForm: Kinetics Informed Feature Optimised Representation Models for Enzyme $k_{cat}$ and $K_{M}$ Prediction","abstract":"Kinetic parameters such as the turnover number ($k_{cat}$) and Michaelis constant ($K_{\\mathrm{M}}$) are essential for modelling enzymatic activity but experimental data remains limited in scale and diversity. Previous methods for predicting enzyme kinetics typically use mean-pooled residue embeddings from a single protein language model to represent the protein. We present KinForm, a machine learning framework designed to improve predictive accuracy and generalisation for kinetic parameters by optimising protein feature representations. KinForm combines several residue-level embeddings (Evolutionary Scale Modeling Cambrian, Evolutionary Scale Modeling 2, and ProtT5-XL-UniRef50), taken from empirically selected intermediate transformer layers and applies weighted pooling based on per-residue binding-site probability. To counter the resulting high dimensionality, we apply dimensionality reduction using principal--component analysis (PCA) on concatenated protein features, and rebalance the training data via a similarity-based oversampling strategy. KinForm outperforms baseline methods on two benchmark datasets. Improvements are most pronounced in low sequence similarity bins. We observe improvements from binding-site probability pooling, intermediate-layer selection, PCA, and oversampling of low-identity proteins. We also find that removing sequence overlap between folds provides a more realistic evaluation of generalisation and should be the standard over random splitting when benchmarking kinetic prediction models.","short_abstract":"Kinetic parameters such as the turnover number ($k_{cat}$) and Michaelis constant ($K_{\\mathrm{M}}$) are essential for modelling enzymatic activity but experimental data remains limited in scale and diversity. Previous methods for predicting enzyme kinetics typically use mean-pooled residue embeddings from a single pro...","url_abs":"https://arxiv.org/abs/2507.14639","url_pdf":"https://arxiv.org/pdf/2507.14639v1","authors":"[\"Saleh Alwer\",\"Ronan Fleming\"]","published":"2025-07-19T14:34:57Z","proceeding":"q-bio.QM","tasks":"[\"q-bio.QM\",\"cs.LG\"]","methods":"[\"Transformer\",\"Language Model\"]","has_code":false}
