{"ID":3052363,"CreatedAt":"2026-06-04T04:41:36.695875263Z","UpdatedAt":"2026-06-06T07:36:27.491215713Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.04552","arxiv_id":"2606.04552","title":"LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling","abstract":"Genomic foundation models increasingly adopt large language model architectures, yet almost universally rely on fixed tokenization schemes such as $k$-mers, BPE, or single nucleotides, which impose arbitrary sequence boundaries that may obscure biologically relevant structure. We present LDARNet, a 120M-parameter hierarchical genomic foundation model that adapts H-Net-style dynamic chunking from autoregressive generation to masked language modeling, combining BiMamba-2 state-space layers with local attention, bidirectional routing, and a ratio-based regularizer to induce adaptive token boundaries without supervision. Fine-tuned on 27 tasks from the Nucleotide Transformer and Genomic Benchmarks suites, LDARNet achieves 11/18 wins among compact models ($\u003c$300M parameters) and state-of-the-art results on 5 histone modification tasks, outperforming models up to 20$\\times$ larger. A FLOPs-matched controlled experiment isolates learned routing as the source of these gains: learned boundaries beat fixed-grid boundaries by up to 14 percentage points on histone tasks at identical compute. Nucleotide-resolution analysis further shows that the learned boundaries align with canonical promoter motifs and splice junctions without supervision, providing a biological interpretation for adaptive tokenization in genomic foundation models.","short_abstract":"Genomic foundation models increasingly adopt large language model architectures, yet almost universally rely on fixed tokenization schemes such as $k$-mers, BPE, or single nucleotides, which impose arbitrary sequence boundaries that may obscure biologically relevant structure. We present LDARNet, a 120M-parameter hiera...","url_abs":"https://arxiv.org/abs/2606.04552","url_pdf":"https://arxiv.org/pdf/2606.04552v1","authors":"[\"Daria Ledneva\",\"Denis Kuznetsov\"]","published":"2026-06-03T07:38:17Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"q-bio.GN\"]","methods":"[\"Transformer\",\"Language Model\"]","has_code":false}
