{"ID":2900883,"CreatedAt":"2026-06-01T05:51:17.9442275Z","UpdatedAt":"2026-06-01T06:23:29.641557848Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2605.30963","arxiv_id":"2605.30963","title":"AMix-2: Establishing Protein as a Native Modality in Large Language Models","abstract":"We present AMix-2, a protein-text foundation model that establishes protein as a native modality in large language models (LLMs), unifying protein understanding and sequence design within a single foundation model. AMix-2 is built upon two key ideas: (1) a unified protein-text formulation that embeds natural language and protein sequence in a shared token space, enabling one model to perform biological reasoning and conditional design instead of separate downstream task-specialized models; and (2) a block-wise diffusion language modeling backbone that combines causal generation across blocks with bidirectional context and iterative refinement within blocks. This scheme better matches the intrinsic nature of proteins than a strict left-to-right factorization. To evaluate protein foundation models under realistic generalization settings, we further introduce ProteinArena, a comprehensive benchmark with time-aware and homology-aware protocols across various understanding and design tasks, and with baselines covering classical bioinformatics tools, protein-specialized models and LLMs. On ProteinArena, AMix-2 outperforms frontier LLMs and demonstrates competitive performance to task-specific protein models. Controlled experiments further show that the diffusion-based paradigm generally surpasses its autoregressive counterpart, highlighting the advantage of flexible generation order for protein sequences. We release both AMix-2 and ProteinArena to facilitate open research in protein foundation models.","short_abstract":"We present AMix-2, a protein-text foundation model that establishes protein as a native modality in large language models (LLMs), unifying protein understanding and sequence design within a single foundation model. AMix-2 is built upon two key ideas: (1) a unified protein-text formulation that embeds natural language a...","url_abs":"https://arxiv.org/abs/2605.30963","url_pdf":"https://arxiv.org/pdf/2605.30963v1","authors":"[\"Keyue Qiu\",\"Yixin Wu\",\"Lihao Wang\",\"Yawen Ouyang\",\"Jixiang Yu\",\"Zihan Zhou\",\"Changze Lv\",\"Dongyu Xue\",\"Yuxuan Song\",\"Xinbo Zhang\",\"Hao Wang\",\"Jiangtao Feng\",\"Zhiqiang Gao\",\"Lijun Wu\",\"Xiaoqing Zheng\",\"Ka-Chun Wong\",\"Lei Bai\",\"Ya-Qin Zhang\",\"Wei-Ying Ma\",\"Dahua Lin\",\"Bowen Zhou\",\"Hao Zhou\"]","published":"2026-05-29T07:58:08Z","proceeding":"q-bio.BM","tasks":"[\"q-bio.BM\",\"cs.AI\"]","methods":"[\"Diffusion Model\",\"Large Language Model\",\"Language Model\"]","has_code":false}
