{"ID":3049968,"CreatedAt":"2026-06-04T02:13:16.786527022Z","UpdatedAt":"2026-06-06T15:12:00.6907593Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05003","arxiv_id":"2606.05003","title":"PhysDox: Benchmarking LLMs on Physical Feasibility Auditing of Physiological Sensing Protocols","abstract":"Large language models (LLMs) increasingly assist in experimental design, yet fluent protocols often remain physically infeasible. We introduce PhysDox, a physical feasibility auditing benchmark for biomedical protocols comprising a 683-sample expert-curated Gold set and a 5,000-sample Silver set across six sensing domains. We formulate the task as a two-stage evaluation: severity detection classifying protocols as valid, minor, or fatal, followed by the constraint-level diagnosis of fatal violations. Evaluating 6 LLMs across 4 inference strategies yields a peak Stage-1 macro-F1 of only 53.0. Moreover, strong oracle diagnosis collapses during end-to-end evaluation due to correlated cascade errors. Error analysis reveals scaffold bias, where models conflate procedural completeness with physical validity. Consequently, implicit constraints exhibit a 2 times higher miss rate than explicit hardware violations, supported by strong statistical correlation at $ρ{=}0.81$ and $p{\u003c}0.01$. Trace analysis of false negatives exposes a 54%--46% split between attention and judgment failures, ultimately demonstrating that protocol auditing demands calibrated feasibility reasoning rather than factual recall or longer rationales.","short_abstract":"Large language models (LLMs) increasingly assist in experimental design, yet fluent protocols often remain physically infeasible. We introduce PhysDox, a physical feasibility auditing benchmark for biomedical protocols comprising a 683-sample expert-curated Gold set and a 5,000-sample Silver set across six sensing doma...","url_abs":"https://arxiv.org/abs/2606.05003","url_pdf":"https://arxiv.org/pdf/2606.05003v1","authors":"[\"He Liu\",\"Boyuan Gu\",\"Shuaiqi Cheng\",\"Haiyang Sun\",\"Siyu You\",\"Xuming Hu\"]","published":"2026-06-03T15:20:15Z","proceeding":"cs.HC","tasks":"[\"cs.HC\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
