EntroGD: Scalable Generalized Deduplication for Efficient Direct Analytics on Compressed IoT Data

cs.DB arXiv:2511.04148
View PDF arXiv JSON

Abstract

Massive data streams from IoT and cyber-physical systems must be processed under strict bandwidth, latency, and resource constraints. Generalized Deduplication (GD) is a promising lossless compression framework, as it supports random access and direct analytics on compressed data. However, existing GD algorithms exhibit quadratic complexity $\mathcal{O}(nd^{2})$, which limits their scalability for high-dimensional datasets. This paper proposes \textbf{EntroGD}, an entropy-guided GD framework that decouples analytical fidelity from compression efficiency to achieve linear complexity $\mathcal{O}(nd)$. EntroGD adopts a two-stage design, first constructing compact condensed samples to preserve information critical for analytics, and then applying entropy-based bit selection to maximize compression. Experiments on 18 IoT datasets show that EntroGD reduces configuration time by up to $53.5\times$ compared to state-of-the-art GD compressors. Moreover, by enabling analytics with access to only $2.6\%$ of the original data volume, EntroGD accelerates clustering by up to $31.6\times$ with negligible loss in accuracy. Overall, EntroGD provides a scalable and system-efficient solution for direct analytics on compressed IoT data.

PDF Viewer