{"ID":2894930,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.10391","arxiv_id":"2507.10391","title":"Instance-Optimized String Fingerprints","abstract":"Recent research found that cloud data warehouses are text-heavy. However, their capabilities for efficiently processing string columns remain limited, relying primarily on techniques like dictionary encoding and prefix-based partition pruning. In recent work, we introduced string fingerprints - a lightweight secondary index structure designed to approximate LIKE predicates, albeit with false positives. This approach is particularly compelling for columnar query engines, where fingerprints can help reduce both compute and I/O overhead. We show that string fingerprints can be optimized for specific workloads using mixed-integer optimization, and that they can generalize to unseen table predicates. On an IMDb column evaluated in DuckDB v1.3, this yields table-scan speedups of up to 1.36$\\times$.","short_abstract":"Recent research found that cloud data warehouses are text-heavy. However, their capabilities for efficiently processing string columns remain limited, relying primarily on techniques like dictionary encoding and prefix-based partition pruning. In recent work, we introduced string fingerprints - a lightweight secondary...","url_abs":"https://arxiv.org/abs/2507.10391","url_pdf":"https://arxiv.org/pdf/2507.10391v1","authors":"[\"Mihail Stoian\",\"Johannes Thürauf\",\"Andreas Zimmerer\",\"Alexander van Renen\",\"Andreas Kipf\"]","published":"2025-07-14T15:30:36Z","proceeding":"cs.DB","tasks":"[\"cs.DB\"]","methods":"[]","has_code":false}
