{"ID":2849532,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.23127","arxiv_id":"2510.23127","title":"Lost in Tokenization: Context as the Key to Unlocking Biomolecular Understanding in Scientific LLMs","abstract":"Scientific Large Language Models (Sci-LLMs) have emerged as a promising frontier for accelerating biological discovery. However, these models face a fundamental challenge when processing raw biomolecular sequences: the tokenization dilemma. Whether treating sequences as a specialized language, risking the loss of functional motif information, or as a separate modality, introducing formidable alignment challenges, current strategies fundamentally limit their reasoning capacity. We challenge this sequence-centric paradigm by positing that a more effective strategy is to provide Sci-LLMs with high-level structured context derived from established bioinformatics tools, thereby bypassing the need to interpret low-level noisy sequence data directly. Through a systematic comparison of leading Sci-LLMs on biological reasoning tasks, we tested three input modes: sequence-only, context-only, and a combination of both. Our findings are striking: the context-only approach consistently and substantially outperforms all other modes. Even more revealing, the inclusion of the raw sequence alongside its high-level context consistently degrades performance, indicating that raw sequences act as informational noise, even for models with specialized tokenization schemes. These results suggest that the primary strength of existing Sci-LLMs lies not in their nascent ability to interpret biomolecular syntax from scratch, but in their profound capacity for reasoning over structured, human-readable knowledge. Therefore, we argue for reframing Sci-LLMs not as sequence decoders, but as powerful reasoning engines over expert knowledge. This work lays the foundation for a new class of hybrid scientific AI agents, repositioning the developmental focus from direct sequence interpretation towards high-level knowledge synthesis. The code is available at https://github.com/opendatalab-raiser/CoKE.","short_abstract":"Scientific Large Language Models (Sci-LLMs) have emerged as a promising frontier for accelerating biological discovery. However, these models face a fundamental challenge when processing raw biomolecular sequences: the tokenization dilemma. Whether treating sequences as a specialized language, risking the loss of funct...","url_abs":"https://arxiv.org/abs/2510.23127","url_pdf":"https://arxiv.org/pdf/2510.23127v2","authors":"[\"Kai Zhuang\",\"Jiawei Zhang\",\"Yumou Liu\",\"Hanqun Cao\",\"Chunbin Gu\",\"Mengdi Liu\",\"Zhangyang Gao\",\"Zitong Jerry Wang\",\"Xuanhe Zhou\",\"Pheng-Ann Heng\",\"Lijun Wu\",\"Conghui He\",\"Cheng Tan\"]","published":"2025-10-27T09:03:21Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":607716,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2849532,"paper_url":"https://arxiv.org/abs/2510.23127","paper_title":"Lost in Tokenization: Context as the Key to Unlocking Biomolecular Understanding in Scientific LLMs","repo_url":"https://github.com/opendatalab-raiser/CoKE","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
