Sequential Spectral Clustering of Data Sequences
Abstract
We study the problem of non-parametric clustering of data sequences, where each data sequence comprises independent and identically distributed (i.i.d.) samples generated from an unknown distribution. The true clusters are the clusters obtained using the Spectral clustering algorithm (SPEC) on the pairwise distance between the true distributions corresponding to the data sequences. Since the true distributions are unknown, the objective is to estimate the clusters by observing the minimum number of samples from the data sequences, given a specified error probability. To solve this problem, we propose the Sequential Spectral clustering algorithm (SEQ-SPEC), and show that it stops in finite time almost surely and is exponentially consistent. We also propose a computationally more efficient algorithm called the Incremental Approximate Sequential Spectral clustering algorithm (IA-SEQ-SPEC). Through simulations, we show that both SEQ-SPEC and IA-SEQ-SPEC perform better than the fixed sample size SPEC, the Sequential $K$-Medoids clustering algorithm (SEQ-KMED), and the Sequential Single Linkage clustering algorithm (SEQ-SLINK). In addition, we propose memory-efficient versions, SEQ-SPEC-B and IA-SEQ-SPEC-B. Unlike other related sequential clustering algorithms, which require storing all past samples, these algorithms require storing only the most recent $B$ samples. Both the computationally efficient and memory-efficient versions of SEQ-SPEC perform comparably to SEQ-SPEC in simulations.