{"ID":2841179,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.12004","arxiv_id":"2511.12004","title":"ComLQ: Benchmarking Complex Logical Queries in Information Retrieval","abstract":"Information retrieval (IR) systems play a critical role in navigating information overload across various applications. Existing IR benchmarks primarily focus on simple queries that are semantically analogous to single- and multi-hop relations, overlooking \\emph{complex logical queries} involving first-order logic operations such as conjunction ($\\land$), disjunction ($\\lor$), and negation ($\\lnot$). Thus, these benchmarks can not be used to sufficiently evaluate the performance of IR models on complex queries in real-world scenarios. To address this problem, we propose a novel method leveraging large language models (LLMs) to construct a new IR dataset \\textbf{ComLQ} for \\textbf{Com}plex \\textbf{L}ogical \\textbf{Q}ueries, which comprises 2,909 queries and 11,251 candidate passages. A key challenge in constructing the dataset lies in capturing the underlying logical structures within unstructured text. Therefore, by designing the subgraph-guided prompt with the subgraph indicator, an LLM (such as GPT-4o) is guided to generate queries with specific logical structures based on selected passages. All query-passage pairs in ComLQ are ensured \\emph{structure conformity} and \\emph{evidence distribution} through expert annotation. To better evaluate whether retrievers can handle queries with negation, we further propose a new evaluation metric, \\textbf{Log-Scaled Negation Consistency} (\\textbf{LSNC@$K$}). As a supplement to standard relevance-based metrics (such as nDCG and mAP), LSNC@$K$ measures whether top-$K$ retrieved passages violate negation conditions in queries. Our experimental results under zero-shot settings demonstrate existing retrieval models' limited performance on complex logical queries, especially on queries with negation, exposing their inferior capabilities of modeling exclusion.","short_abstract":"Information retrieval (IR) systems play a critical role in navigating information overload across various applications. Existing IR benchmarks primarily focus on simple queries that are semantically analogous to single- and multi-hop relations, overlooking \\emph{complex logical queries} involving first-order logic oper...","url_abs":"https://arxiv.org/abs/2511.12004","url_pdf":"https://arxiv.org/pdf/2511.12004v2","authors":"[\"Ganlin Xu\",\"Zhitao Yin\",\"Linghao Zhang\",\"Jiaqing Liang\",\"Weijia Lu\",\"Xiaodong Zhang\",\"Zhifei Yang\",\"Sihang Jiang\",\"Deqing Yang\"]","published":"2025-11-15T02:58:21Z","proceeding":"cs.IR","tasks":"[\"cs.IR\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}