R-enum Revisited: Speedup and Extension for Context-Sensitive Repeats and Net Frequencies
Abstract
Nishimoto and Tabei [CPM, 2021] proposed r-enum, an algorithm to enumerate various characteristic substrings, including maximal repeats, in a string $T$ of length $n$ in $O(r)$ words of compressed working space, where $r \le n$ is the number of runs in the Burrows-Wheeler transform (BWT) of $T$. Given the run-length encoded BWT (RLBWT) of $T$, r-enum runs in $O(n \log \log_{w} (n/r))$ time in addition to the time linear to the number of output strings, where $w = Θ(\log n)$ is the word size. In this paper, we first improve the $O(n \log \log_{w} (n/r))$ term to $O(n)$. We next extend r-enum to compute other context-sensitive repeats such as near-supermaximal repeats (NSMRs) and supermaximal repeats, as well as the context diversity for every maximal repeat in the same complexities. Furthermore, we study net occurrences: An occurrence of a repeat is called a net occurrence if it is not covered by another repeat, and the net frequency of a repeat is the number of its net occurrences. With this terminology, an NSMR is a repeat with a positive net frequency. Given the RLBWT of $T$, we show how to compute the set $S^{nsmr}$ of all NSMRs in $T$ together with their net frequency/occurrences in $O(n)$ time and $O(r)$ space. We also show that an $O(r)$-space data structure can be built from the RLBWT to compute the net frequency/occurrences of any pattern in optimal time. The data structure is built in $O(r)$ space and in $O(n)$ time with high probability or deterministic $O(n + |S^{nsmr}| \log \log \min(σ, |S^{nsmr}|))$ time, where $σ\le r$ is the alphabet size of $T$. To achieve this, we prove that the total number of net occurrences is less than $2r$. With the duality between net occurrences and \emph{minimal unique substrings (MUSs)}, we get a new upper bound $2r$ of the number of MUSs in $T$, which may be of independent interest.