{"ID":2864805,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.23415","arxiv_id":"2509.23415","title":"From Conversation to Query Execution: Benchmarking User and Tool Interactions for EHR Database Agents","abstract":"Despite the impressive performance of LLM-powered agents, their adoption for Electronic Health Record (EHR) data access remains limited by the absence of benchmarks that adequately capture real-world clinical data access flows. In practice, two core challenges hinder deployment: query ambiguity from vague user questions and value mismatch between user terminology and database entries. To address this, we introduce EHR-ChatQA, an interactive database question answering benchmark that evaluates the end-to-end workflow of database agents: clarifying user questions, using tools to resolve value mismatches, and generating correct SQL to deliver accurate answers. To cover diverse patterns of query ambiguity and value mismatch, EHR-ChatQA assesses agents in a simulated environment with an LLM-based user across two interaction flows: Incremental Query Refinement (IncreQA), where users add constraints to existing queries, and Adaptive Query Refinement (AdaptQA), where users adjust their search goals mid-conversation. Experiments with state-of-the-art LLMs (e.g., o4-mini and Gemini-2.5-Flash) over five i.i.d. trials show that while the best-performing agents achieve Pass@5 of over 90% (at least one of five trials) on IncreQA and 60-70% on AdaptQA, their Pass^5 (consistent success across all five trials) is substantially lower, with gaps of up to about 60%. These results underscore the need to build agents that are not only performant but also robust for the safety-critical EHR domain. Finally, we provide diagnostic insights into common failure modes to guide future agent development. Our code and data are publicly available at https://github.com/glee4810/EHR-ChatQA.","short_abstract":"Despite the impressive performance of LLM-powered agents, their adoption for Electronic Health Record (EHR) data access remains limited by the absence of benchmarks that adequately capture real-world clinical data access flows. In practice, two core challenges hinder deployment: query ambiguity from vague user question...","url_abs":"https://arxiv.org/abs/2509.23415","url_pdf":"https://arxiv.org/pdf/2509.23415v2","authors":"[\"Gyubok Lee\",\"Woosog Chay\",\"Heeyoung Kwak\",\"Yeong Hwa Kim\",\"Haanju Yoo\",\"Oksoon Jeong\",\"Meong Hi Son\",\"Edward Choi\"]","published":"2025-09-27T17:13:51Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Large Language Model\"]","has_code":false,"code_links":[{"ID":609202,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2864805,"paper_url":"https://arxiv.org/abs/2509.23415","paper_title":"From Conversation to Query Execution: Benchmarking User and Tool Interactions for EHR Database Agents","repo_url":"https://github.com/glee4810/EHR-ChatQA","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}