{"ID":2830933,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.10147","arxiv_id":"2512.10147","title":"Murmur2Vec: A Hashing Based Solution For Embedding Generation Of COVID-19 Spike Sequences","abstract":"Early detection and characterization of coronavirus disease (COVID-19), caused by SARS-CoV-2, remain critical for effective clinical response and public-health planning. The global availability of large-scale viral sequence data presents significant opportunities for computational analysis; however, existing approaches face notable limitations. Phylogenetic tree-based methods are computationally intensive and do not scale efficiently to today's multi-million-sequence datasets. Similarly, current embedding-based techniques often rely on aligned sequences or exhibit suboptimal predictive performance and high runtime costs, creating barriers to practical large-scale analysis. In this study, we focus on the most prevalent SARS-CoV-2 lineages associated with the spike protein region and introduce a scalable embedding method that leverages hashing to generate compact, low-dimensional representations of spike sequences. These embeddings are subsequently used to train a variety of machine learning models for supervised lineage classification. We conduct an extensive evaluation comparing our approach with multiple baseline and state-of-the-art biological sequence embedding methods across diverse metrics. Our results demonstrate that the proposed embeddings offer substantial improvements in efficiency, achieving up to 86.4\\% classification accuracy while reducing embedding generation time by as much as 99.81\\%. This highlights the method's potential as a fast, effective, and scalable solution for large-scale viral sequence analysis.","short_abstract":"Early detection and characterization of coronavirus disease (COVID-19), caused by SARS-CoV-2, remain critical for effective clinical response and public-health planning. The global availability of large-scale viral sequence data presents significant opportunities for computational analysis; however, existing approaches...","url_abs":"https://arxiv.org/abs/2512.10147","url_pdf":"https://arxiv.org/pdf/2512.10147v1","authors":"[\"Sarwan Ali\",\"Taslim Murad\"]","published":"2025-12-10T23:03:10Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"q-bio.GN\"]","methods":"[]","has_code":false}
