{"ID":2839986,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.14365","arxiv_id":"2511.14365","title":"The Tokenization Bottleneck: How Vocabulary Extension Improves Chemistry Representation Learning in Pretrained Language Models","abstract":"The application of large language models (LLMs) to chemistry is frequently hampered by a \"tokenization bottleneck\", where tokenizers tuned on general-domain text tend to fragment chemical representations such as SMILES into semantically uninformative sub-tokens. This paper introduces a principled methodology to resolve this bottleneck by unifying the representation of natural language and molecular structures within a single model. Our approach involves targeted vocabulary extension-augmenting a pretrained LLM's vocabulary with chemically salient tokens, followed by continued pretraining on chemistry-domain text to integrate this new knowledge. We provide an empirical demonstration of the effectiveness of this strategy, showing that our methodology leads to superior performance on a range of downstream chemical tasks.","short_abstract":"The application of large language models (LLMs) to chemistry is frequently hampered by a \"tokenization bottleneck\", where tokenizers tuned on general-domain text tend to fragment chemical representations such as SMILES into semantically uninformative sub-tokens. This paper introduces a principled methodology to resolve...","url_abs":"https://arxiv.org/abs/2511.14365","url_pdf":"https://arxiv.org/pdf/2511.14365v1","authors":"[\"Prathamesh Kalamkar\",\"Ned Letcher\",\"Meissane Chami\",\"Sahger Lad\",\"Shayan Mohanty\",\"Prasanna Pendse\"]","published":"2025-11-18T11:12:35Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
