{"ID":2888725,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.22514","arxiv_id":"2507.22514","title":"SmilesT5: Domain-specific pretraining for molecular language models","abstract":"Molecular property prediction is an increasingly critical task within drug discovery and development. Typically, neural networks can learn molecular properties using graph-based, language-based or feature-based methods. Recent advances in natural language processing have highlighted the capabilities of neural networks to learn complex human language using masked language modelling. These approaches to training large transformer-based deep learning models have also been used to learn the language of molecules, as represented by simplified molecular-input line-entry system (SMILES) strings. Here, we present novel domain-specific text-to-text pretraining tasks that yield improved performance in six classification-based molecular property prediction benchmarks, relative to both traditional likelihood-based training and previously proposed fine-tuning tasks. Through ablation studies, we show that data and computational efficiency can be improved by using these domain-specific pretraining tasks. Finally, the pretrained embeddings from the model can be used as fixed inputs into a downstream machine learning classifier and yield comparable performance to finetuning but with much lower computational overhead.","short_abstract":"Molecular property prediction is an increasingly critical task within drug discovery and development. Typically, neural networks can learn molecular properties using graph-based, language-based or feature-based methods. Recent advances in natural language processing have highlighted the capabilities of neural networks...","url_abs":"https://arxiv.org/abs/2507.22514","url_pdf":"https://arxiv.org/pdf/2507.22514v1","authors":"[\"Philip Spence\",\"Brooks Paige\",\"Anne Osbourn\"]","published":"2025-07-30T09:36:18Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Transformer\",\"Language Model\"]","has_code":false}
