{"ID":2870142,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.12610","arxiv_id":"2509.12610","title":"ScaleDoc: Scaling LLM-based Predicates over Large Document Collections","abstract":"Predicates are foundational components in data analysis systems. However, modern workloads increasingly involve unstructured documents, which demands semantic understanding, beyond traditional value-based predicates. Given enormous documents and ad-hoc queries, while Large Language Models (LLMs) demonstrate powerful zero-shot capabilities, their high inference cost leads to unacceptable overhead. Therefore, we introduce \\textsc{ScaleDoc}, a novel system that addresses this by decoupling predicate execution into an offline representation phase and an optimized online filtering phase. In the offline phase, \\textsc{ScaleDoc} leverages a LLM to generate semantic representations for each document. Online, for each query, it trains a lightweight proxy model on these representations to filter the majority of documents, forwarding only the ambiguous cases to the LLM for final decision. Furthermore, \\textsc{ScaleDoc} proposes two core innovations to achieve significant efficiency: (1) a contrastive-learning-based framework that trains the proxy model to generate reliable predicating decision scores; (2) an adaptive cascade mechanism that determines the effective filtering policy while meeting specific accuracy targets. Our evaluations across three datasets demonstrate that \\textsc{ScaleDoc} achieves over a 2$\\times$ end-to-end speedup and reduces expensive LLM invocations by up to 85\\%, making large-scale semantic analysis practical and efficient.","short_abstract":"Predicates are foundational components in data analysis systems. However, modern workloads increasingly involve unstructured documents, which demands semantic understanding, beyond traditional value-based predicates. Given enormous documents and ad-hoc queries, while Large Language Models (LLMs) demonstrate powerful ze...","url_abs":"https://arxiv.org/abs/2509.12610","url_pdf":"https://arxiv.org/pdf/2509.12610v2","authors":"[\"Hengrui Zhang\",\"Yulong Hui\",\"Yihao Liu\",\"Huanchen Zhang\"]","published":"2025-09-16T03:18:06Z","proceeding":"cs.DB","tasks":"[\"cs.DB\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}