{"ID":2856228,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.11188","arxiv_id":"2510.11188","title":"Protein as a Second Language for LLMs","abstract":"Deciphering the function of unseen protein sequences is a fundamental challenge with broad scientific impact, yet most existing methods depend on task-specific adapters or large-scale supervised fine-tuning. We introduce the \"Protein-as-Second-Language\" framework, which reformulates amino-acid sequences as sentences in a novel symbolic language that large language models can interpret through contextual exemplars. Our approach adaptively constructs sequence-question-answer triples that reveal functional cues in a zero-shot setting, without any further training. To support this process, we curate a bilingual corpus of 79,926 protein-QA instances spanning attribute prediction, descriptive understanding, and extended reasoning. Empirically, our method delivers consistent gains across diverse open-source LLMs and GPT-4, achieving up to 17.2% ROUGE-L improvement (average +7%) and even surpassing fine-tuned protein-specific language models. These results highlight that generic LLMs, when guided with protein-as-language cues, can outperform domain-specialized models, offering a scalable pathway for protein understanding in foundation models.","short_abstract":"Deciphering the function of unseen protein sequences is a fundamental challenge with broad scientific impact, yet most existing methods depend on task-specific adapters or large-scale supervised fine-tuning. We introduce the \"Protein-as-Second-Language\" framework, which reformulates amino-acid sequences as sentences in...","url_abs":"https://arxiv.org/abs/2510.11188","url_pdf":"https://arxiv.org/pdf/2510.11188v1","authors":"[\"Xinhui Chen\",\"Zuchao Li\",\"Mengqi Gao\",\"Yufeng Zhang\",\"Chak Tou Leong\",\"Haoyang Li\",\"Jiaqi Chen\"]","published":"2025-10-13T09:21:45Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"q-bio.BM\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
