{"ID":2847938,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.26264","arxiv_id":"2510.26264","title":"Space-Efficient k-Mismatch Text Indexes","abstract":"A central task in string processing is text indexing, where the goal is to preprocess a text (a string of length $n$) into an efficient index (a data structure) supporting queries about the text. Cole, Gottlieb, and Lewenstein (STOC 2004) proposed $k$-errata trees, a family of text indexes supporting approximate pattern matching queries of several types. In particular, $k$-errata trees yield an elegant solution to $k$-mismatch queries, where we are to report all substrings of the text with Hamming distance at most $k$ to the query pattern. The resulting $k$-mismatch index uses $O(n\\log^k n)$ space and answers a query for a length-$m$ pattern in $O(\\log^k n \\log \\log n + m + occ)$ time, where $occ$ is the number of approximate occurrences. In retrospect, $k$-errata trees appear very well optimized: even though a large body of work has adapted $k$-errata trees to various settings throughout the past two decades, the original time-space trade-off for $k$-mismatch indexing has not been improved in the general case. We present the first such improvement, a $k$-mismatch index with $O(n\\log^{k-1} n)$ space and the same query time as $k$-errata trees. Previously, due to a result of Chan, Lam, Sung, Tam, and Wong (Algorithmica 2010), such an $O(n\\log^{k-1} n)$-size index has been known only for texts over alphabets of constant size. In this setting, however, we obtain an even smaller $k$-mismatch index of size only $O(n \\log^{k-2+\\varepsilon+\\frac{2}{k+2-(k \\bmod 2)}} n)\\subseteq O(n\\log^{k-1.5+\\varepsilon} n)$ for $2\\le k\\le O(1)$ and any constant $\\varepsilon\u003e0$. Along the way, we also develop improved indexes for short patterns, offering better trade-offs in this practically relevant special case.","short_abstract":"A central task in string processing is text indexing, where the goal is to preprocess a text (a string of length $n$) into an efficient index (a data structure) supporting queries about the text. Cole, Gottlieb, and Lewenstein (STOC 2004) proposed $k$-errata trees, a family of text indexes supporting approximate patter...","url_abs":"https://arxiv.org/abs/2510.26264","url_pdf":"https://arxiv.org/pdf/2510.26264v1","authors":"[\"Tomasz Kociumaka\",\"Jakub Radoszewski\"]","published":"2025-10-30T08:45:00Z","proceeding":"cs.DS","tasks":"[\"cs.DS\"]","methods":"[\"Generative Adversarial Network\"]","has_code":false}
