{"ID":2845231,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.04148","arxiv_id":"2511.04148","title":"EntroGD: Scalable Generalized Deduplication for Efficient Direct Analytics on Compressed IoT Data","abstract":"Massive data streams from IoT and cyber-physical systems must be processed under strict bandwidth, latency, and resource constraints. Generalized Deduplication (GD) is a promising lossless compression framework, as it supports random access and direct analytics on compressed data. However, existing GD algorithms exhibit quadratic complexity $\\mathcal{O}(nd^{2})$, which limits their scalability for high-dimensional datasets. This paper proposes \\textbf{EntroGD}, an entropy-guided GD framework that decouples analytical fidelity from compression efficiency to achieve linear complexity $\\mathcal{O}(nd)$. EntroGD adopts a two-stage design, first constructing compact condensed samples to preserve information critical for analytics, and then applying entropy-based bit selection to maximize compression. Experiments on 18 IoT datasets show that EntroGD reduces configuration time by up to $53.5\\times$ compared to state-of-the-art GD compressors. Moreover, by enabling analytics with access to only $2.6\\%$ of the original data volume, EntroGD accelerates clustering by up to $31.6\\times$ with negligible loss in accuracy. Overall, EntroGD provides a scalable and system-efficient solution for direct analytics on compressed IoT data.","short_abstract":"Massive data streams from IoT and cyber-physical systems must be processed under strict bandwidth, latency, and resource constraints. Generalized Deduplication (GD) is a promising lossless compression framework, as it supports random access and direct analytics on compressed data. However, existing GD algorithms exhibi...","url_abs":"https://arxiv.org/abs/2511.04148","url_pdf":"https://arxiv.org/pdf/2511.04148v2","authors":"[\"Xiaobo Zhao\",\"Daniel E. Lucani\"]","published":"2025-11-06T07:54:46Z","proceeding":"cs.DB","tasks":"[\"cs.DB\"]","methods":"[]","has_code":false}
