{"ID":3050009,"CreatedAt":"2026-06-04T02:13:16.786527022Z","UpdatedAt":"2026-06-06T14:07:05.414468951Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.04915","arxiv_id":"2606.04915","title":"Caliper: Probing Lexical Anchors versus Causal Structure in LLMs","abstract":"Large language models reach 50 to 70% accuracy on causal reasoning benchmarks such as CLadder, but it is unclear whether this reflects structural reasoning or lexical pattern matching. We introduce Caliper, a controlled perturbation that replaces semantic variable names with placeholder tokens while preserving the causal graph and probabilistic specification of each question. Across nine instruction-tuned LLMs from 3.8B to 671B and three causal reasoning benchmarks, lexical anonymization yields robust accuracy drops of +7.6, +27.0, and +11.1 pp on a local 3.8B-14B set, rising to +29.6 and +18.0 pp on CRASS and e-CARE across nine frontier models spanning the 2024-2026 generations. Of 40 engaged model-by-benchmark cells, 39 show a positive gap, and the gap collapses by 17x on CLadder's pseudoword subset. Structured scaffolding and few-shot in-context learning each narrow the gap, but mainly by lowering P0 accuracy on smaller models rather than recovering P1. Current instruction-tuned LLMs, evaluated zero-shot, show little evidence of structural causal reasoning once lexical anchors are removed.","short_abstract":"Large language models reach 50 to 70% accuracy on causal reasoning benchmarks such as CLadder, but it is unclear whether this reflects structural reasoning or lexical pattern matching. We introduce Caliper, a controlled perturbation that replaces semantic variable names with placeholder tokens while preserving the caus...","url_abs":"https://arxiv.org/abs/2606.04915","url_pdf":"https://arxiv.org/pdf/2606.04915v1","authors":"[\"Zhenyu Yu\",\"Shuigeng Zhou\"]","published":"2026-06-03T14:11:16Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.IR\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
