{"ID":3085034,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-05T10:38:01.117085634Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05225","arxiv_id":"2606.05225","title":"The Language of Elution: Autoregressive Prediction of the Next Feature in Untargeted LC-HRMS Lipidomics","abstract":"Untargeted liquid chromatography-high-resolution mass spectrometry (LC-HRMS) detects thousands of molecular features per sample, yet only 2-20% receive confident structural annotations. A root cause of this \"dark metabolome\" is that tandem MS/MS acquisition is reactive: instruments select precursors only after ions appear, blind to what elutes next. We reframe chromatographic elution as an autoregressive sequence prediction task. Because reversed-phase elution order is governed by hydrophobicity, successive features form a physically constrained sequence, like tokens in language. We discretize the mass-to-charge (m/z) axis into 110 bins and train long short-term memory (LSTM) and Transformer models to predict the next eluting m/z bin from five annotation-free per-token features: m/z bin, mass defect, retention-time gap, polarity, and intensity rank. Trained on 15,242 features from four clinical lipidomics cohorts (342 plasma samples; SCIEX TripleTOF 6600+, Waters CSH C18), the LSTM reaches 98.4% top-1 accuracy (99.99% top-5; mean absolute error 3.6 Da) and the Transformer 98.0%. Ablation shows autoregressive context accounts for 55.5 percentage points while no single feature contributes more than 0.2 pp: the sequential pattern, not molecular properties, drives prediction. Models transfer across instruments sharing the method (r=0.999 on an independent Agilent 6530 dataset) but fail under a different column chemistry (5.1% top-1) or polarity mode (2.6%), confirming method- and mode-specificity. Fine-tuning on as few as two to five quality-control injections recovers held-out accuracy from 2.6% to nearly 50%, so cross-condition deployment needs minimal calibration. These results establish that elution sequences are highly predictable and lay the groundwork for predictive MS/MS acquisition to improve annotation coverage in untargeted metabolomics.","short_abstract":"Untargeted liquid chromatography-high-resolution mass spectrometry (LC-HRMS) detects thousands of molecular features per sample, yet only 2-20% receive confident structural annotations. A root cause of this \"dark metabolome\" is that tandem MS/MS acquisition is reactive: instruments select precursors only after ions app...","url_abs":"https://arxiv.org/abs/2606.05225","url_pdf":"https://arxiv.org/pdf/2606.05225v1","authors":"[\"Dayanjan S. Wijesinghe\"]","published":"2026-06-02T10:42:17Z","proceeding":"q-bio.QM","tasks":"[\"q-bio.QM\",\"cs.LG\"]","methods":"[\"Transformer\"]","has_code":false}
