{"ID":2833190,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.05247","arxiv_id":"2512.05247","title":"Incorporating indel channels into average-case analysis of seed-chain-extend","abstract":"Given a sequence $s_1$ of $n$ letters drawn i.i.d. from an alphabet of size $σ$ and a mutated substring $s_2$ of length $m \u003c n$, we often want to recover the mutation history that generated $s_2$ from $s_1$. Modern sequence aligners are widely used for this task, and many employ the seed-chain-extend heuristic with $k$-mer seeds. Previously, Shaw and Yu showed that optimal linear-gap cost chaining can produce a chain with $1 - O\\left(\\frac{1}{\\sqrt{m}}\\right)$ recoverability, the proportion of the mutation history that is recovered, in $O\\left(mn^{2.43θ} \\log n\\right)$ expected time, where $θ\u003c 0.206$ is the mutation rate under a substitution-only channel and $s_1$ is assumed to be uniformly random. However, a gap remains between theory and practice, since real genomic data includes insertions and deletions (indels), and yet seed-chain-extend remains effective. In this paper, we generalize those prior results by introducing mathematical machinery to deal with the two new obstacles introduced by indel channels: the dependence of neighboring anchors and the presence of anchors that are only partially correct. We are thus able to prove that the expected recoverability of an optimal chain is $\\ge 1 - O\\Bigl(\\frac{1}{\\sqrt{m}}\\Bigr)$ and the expected runtime is $O(mn^{3.15 \\cdot θ_T}\\log n)$, when the total mutation rate given by the sum of the substitution, insertion, and deletion mutation rates ($θ_T = θ_i + θ_d + θ_s$) is less than $0.159$.","short_abstract":"Given a sequence $s_1$ of $n$ letters drawn i.i.d. from an alphabet of size $σ$ and a mutated substring $s_2$ of length $m \u003c n$, we often want to recover the mutation history that generated $s_2$ from $s_1$. Modern sequence aligners are widely used for this task, and many employ the seed-chain-extend heuristic with $k$...","url_abs":"https://arxiv.org/abs/2512.05247","url_pdf":"https://arxiv.org/pdf/2512.05247v1","authors":"[\"Spencer Gibson\",\"Yun William Yu\"]","published":"2025-12-04T20:51:20Z","proceeding":"cs.DS","tasks":"[\"cs.DS\",\"q-bio.QM\"]","methods":"[]","has_code":false}
