{"ID":3083752,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-07T06:37:52.911886358Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.06117","arxiv_id":"2606.06117","title":"$p$-adic Bi-Filtrations for Topological Machine Learning on Genomic Sequences","abstract":"We introduce pVR, a topological machine learning framework for alignment-free genomic sequence classification that combines $p$-adic numbers with topological data analysis. Each DNA sequence is encoded along two complementary axes: a $p$-adic distance on $k$-mer prefixes, which captures hierarchical positional structure, and a compositional $L_1$ distance on $k$-mer frequencies, which captures local sequence content. The two distances jointly parameterise a bi-filtered Vietoris--Rips complex, and per-sequence topological summaries from this bi-filtration serve as features for standard machine learning classifiers. We establish theoretical guarantees for the construction: stability under metric perturbations and invariance to the choice of prime, alongside a result that explains why a single $p$-adic axis is topologically uninformative and why the bi-filtration recovers nontrivial homology. On twelve genomic benchmarks ($28$ to $500$ sequences, $3$ to $7$ classes), pVR outperforms four established alignment-free baselines on three of six low-sample datasets, with gains of up to $21$ percentage points; it underperforms only on a SARS-CoV-2 variant benchmark whose point-mutation divergence violates the hierarchical assumption, and all methods saturate in the large-sample regime. pVR also outperforms zero-shot frozen embeddings from the 500M-parameter Nucleotide Transformer v2 by $6.7$ to $11.4$ percentage points on three low-sample benchmarks. The pVR codebase is publicly available at https://github.com/MAHI-Group/pVR.","short_abstract":"We introduce pVR, a topological machine learning framework for alignment-free genomic sequence classification that combines $p$-adic numbers with topological data analysis. Each DNA sequence is encoded along two complementary axes: a $p$-adic distance on $k$-mer prefixes, which captures hierarchical positional structur...","url_abs":"https://arxiv.org/abs/2606.06117","url_pdf":"https://arxiv.org/pdf/2606.06117v1","authors":"[\"Tirtharaj Dash\",\"Gunja Sachdeva\"]","published":"2026-06-04T13:05:36Z","proceeding":"q-bio.QM","tasks":"[\"q-bio.QM\",\"cs.LG\",\"math.AT\",\"q-bio.GN\"]","methods":"[\"Transformer\"]","has_code":false,"code_links":[{"ID":612827,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-05T06:46:15.197025399Z","DeletedAt":null,"paper_id":3083752,"paper_url":"https://arxiv.org/abs/2606.06117","paper_title":"$p$-adic Bi-Filtrations for Topological Machine Learning on Genomic Sequences","repo_url":"https://github.com/MAHI-Group/pVR","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
