{"ID":2862716,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.26116","arxiv_id":"2509.26116","title":"UncertainGen: Uncertainty-Aware Representations of DNA Sequences for Metagenomic Binning","abstract":"Metagenomic binning aims to cluster DNA fragments from mixed microbial samples into their respective genomes, a critical step for downstream analyses of microbial communities. Existing methods rely on deterministic representations, such as k-mer profiles or embeddings from large language models, which fail to capture the uncertainty inherent in DNA sequences arising from inter-species DNA sharing and from fragments with highly similar representations. We present the first probabilistic embedding approach, UncertainGen, for metagenomic binning, representing each DNA fragment as a probability distribution in latent space. Our approach naturally models sequence-level uncertainty, and we provide theoretical guarantees on embedding distinguishability. This probabilistic embedding framework expands the feasible latent space by introducing a data-adaptive metric, which in turn enables more flexible separation of bins/clusters. Experiments on real metagenomic datasets demonstrate the improvements over deterministic k-mer and LLM-based embeddings for the binning task by offering a scalable and lightweight solution for large-scale metagenomic analysis.","short_abstract":"Metagenomic binning aims to cluster DNA fragments from mixed microbial samples into their respective genomes, a critical step for downstream analyses of microbial communities. Existing methods rely on deterministic representations, such as k-mer profiles or embeddings from large language models, which fail to capture t...","url_abs":"https://arxiv.org/abs/2509.26116","url_pdf":"https://arxiv.org/pdf/2509.26116v1","authors":"[\"Abdulkadir Celikkanat\",\"Andres R. Masegosa\",\"Mads Albertsen\",\"Thomas D. Nielsen\"]","published":"2025-09-30T11:36:09Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.CE\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
