The Chonkers Algorithm: Content-Defined Chunking with Provable Strict Guarantees on Size and Locality
Abstract
This paper presents the Chonkers algorithm, a novel content-defined chunking method providing simultaneous provable strict guarantees on chunk size and edit locality. Unlike existing algorithms such as Rabin fingerprinting and anchor-based methods, Chonkers achieves bounded propagation of edits and precise control over chunk sizes. I describe the algorithm's layered structure that allows for combination with other chunking algorithms, the theoretical guarantees it provides, implementation considerations, and introduce the Yarn datatype, a deduplicated, merge-tree-based string representation benefiting from Chonkers' strict guarantees. Finally, I experimentally compare Chonkers' ability to deduplicate versioned data to other algorithms and evaluate Chonkers on three corpora with respect to the actually occurring chunk sizes and edit locality, and find that it performs much better in practice than the proved guarantees.