{"ID":2835132,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.00378","arxiv_id":"2512.00378","title":"The Information Theory of Similarity","abstract":"We establish a precise mathematical equivalence between witness-based similarity systems (REWA) and Shannon's information theory. We prove that witness overlap is mutual information, that REWA bit complexity bounds arise from channel capacity limitations, and that ranking-preserving encodings obey rate-distortion constraints. This unification reveals that fifty years of similarity search research -- from Bloom filters to locality-sensitive hashing to neural retrieval -- implicitly developed information theory for relational data. We derive fundamental lower bounds showing that REWA's $O(Δ^{-2} \\log N)$ complexity is optimal: no encoding scheme can preserve similarity rankings with fewer bits. The framework establishes that semantic similarity has physical units (bits of mutual information), search is communication (query transmission over a noisy channel), and retrieval systems face fundamental capacity limits analogous to Shannon's channel coding theorem.","short_abstract":"We establish a precise mathematical equivalence between witness-based similarity systems (REWA) and Shannon's information theory. We prove that witness overlap is mutual information, that REWA bit complexity bounds arise from channel capacity limitations, and that ranking-preserving encodings obey rate-distortion const...","url_abs":"https://arxiv.org/abs/2512.00378","url_pdf":"https://arxiv.org/pdf/2512.00378v1","authors":"[\"Nikit Phadke\"]","published":"2025-11-29T08:12:45Z","proceeding":"cs.IT","tasks":"[\"cs.IT\",\"cs.DS\",\"cs.IR\",\"cs.LG\"]","methods":"[]","has_code":false}
