{"ID":2867405,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.19586","arxiv_id":"2509.19586","title":"A Foundation Chemical Language Model for Comprehensive Fragment-Based Drug Discovery","abstract":"We introduce FragAtlas-62M, a specialized foundation model trained on the largest fragment dataset to date. Built on the complete ZINC-22 fragment subset comprising over 62 million molecules, it achieves unprecedented coverage of fragment chemical space. Our GPT-2 based model (42.7M parameters) generates 99.90% chemically valid fragments. Validation across 12 descriptors and three fingerprint methods shows generated fragments closely match the training distribution (all effect sizes \u003c 0.4). The model retains 53.6% of known ZINC fragments while producing 22% novel structures with practical relevance. We release FragAtlas-62M with training code, preprocessed data, documentation, and model weights to accelerate adoption.","short_abstract":"We introduce FragAtlas-62M, a specialized foundation model trained on the largest fragment dataset to date. Built on the complete ZINC-22 fragment subset comprising over 62 million molecules, it achieves unprecedented coverage of fragment chemical space. Our GPT-2 based model (42.7M parameters) generates 99.90% chemica...","url_abs":"https://arxiv.org/abs/2509.19586","url_pdf":"https://arxiv.org/pdf/2509.19586v1","authors":"[\"Alexander Ho\",\"Sukyeong Lee\",\"Francis T. F. Tsai\"]","published":"2025-09-23T21:23:36Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"q-bio.BM\"]","methods":"[\"Language Model\"]","has_code":false}
