{"ID":2842504,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.11758","arxiv_id":"2511.11758","title":"Protein Structure Tokenization via Geometric Byte Pair Encoding","abstract":"Protein structure is central to biological function, and enabling multimodal protein models requires joint reasoning over sequence, structure, and function. A key barrier is the lack of principled protein structure tokenizers (PSTs): existing approaches fix token size or rely on continuous vector codebooks, limiting interpretability, multi-scale control, and transfer across architectures. We introduce GeoBPE, a geometry-grounded PST that transforms continuous, noisy, multi-scale backbone conformations into discrete ``sentences'' of geometry while enforcing global constraints. Analogous to byte-pair encoding, GeoBPE generates a hierarchical vocabulary of geometric primitives by iteratively (i) clustering Geo-Pair occurrences with k-medoids to yield a resolution-controllable vocabulary; (ii) quantizing each Geo-Pair to its closest medoid prototype; and (iii) reducing drift through differentiable inverse kinematics that optimizes boundary glue angles under an $\\mathrm{SE}(3)$ end-frame loss. GeoBPE offers compression ($\u003e$10x reduction in bits-per-residue at similar distortion rate), data efficiency ($\u003e$10x less training data), and generalization (maintains test/train distortion ratio of $1.0-1.1$). It is architecture-agnostic: (a) its hierarchical vocabulary provides a strong inductive bias for coarsening residue-level embeddings from large PLMs into motif- and protein-level representations, consistently outperforming leading PSTs across $12$ tasks and $24$ test splits; (b) paired with a transformer, GeoBPE supports unconditional backbone generation via language modeling; and (c) tokens align with CATH functional families and support expert-interpretable case studies, offering functional meaning absent in prior PSTs. Code is available at https://github.com/shiningsunnyday/PT-BPE/.","short_abstract":"Protein structure is central to biological function, and enabling multimodal protein models requires joint reasoning over sequence, structure, and function. A key barrier is the lack of principled protein structure tokenizers (PSTs): existing approaches fix token size or rely on continuous vector codebooks, limiting in...","url_abs":"https://arxiv.org/abs/2511.11758","url_pdf":"https://arxiv.org/pdf/2511.11758v2","authors":"[\"Michael Sun\",\"Weize Yuan\",\"Gang Liu\",\"Wojciech Matusik\",\"Marinka Zitnik\"]","published":"2025-11-13T22:53:29Z","proceeding":"q-bio.QM","tasks":"[\"q-bio.QM\",\"cs.AI\"]","methods":"[\"Transformer\",\"Language Model\"]","has_code":false,"code_links":[{"ID":607132,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2842504,"paper_url":"https://arxiv.org/abs/2511.11758","paper_title":"Protein Structure Tokenization via Geometric Byte Pair Encoding","repo_url":"https://github.com/shiningsunnyday/PT-BPE","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
