The Chonkers Algorithm: Content-Defined Chunking with Provable Strict Guarantees on Size and Locality

cs.DS arXiv:2509.11121
View PDF arXiv JSON

Abstract

This paper presents the Chonkers algorithm, a novel content-defined chunking method providing simultaneous provable strict guarantees on chunk size and edit locality. Unlike existing algorithms such as Rabin fingerprinting and anchor-based methods, Chonkers achieves bounded propagation of edits and precise control over chunk sizes. I describe the algorithm's layered structure that allows for combination with other chunking algorithms, the theoretical guarantees it provides, implementation considerations, and introduce the Yarn datatype, a deduplicated, merge-tree-based string representation benefiting from Chonkers' strict guarantees. Finally, I experimentally compare Chonkers' ability to deduplicate versioned data to other algorithms and evaluate Chonkers on three corpora with respect to the actually occurring chunk sizes and edit locality, and find that it performs much better in practice than the proved guarantees.

PDF Viewer