{"ID":2866707,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.20319","arxiv_id":"2509.20319","title":"Z-Scores: A Metric for Linguistically Assessing Disfluency Removal","abstract":"Evaluating disfluency removal in speech requires more than aggregate token-level scores. Traditional word-based metrics such as precision, recall, and F1 (E-Scores) capture overall performance but cannot reveal why models succeed or fail. We introduce Z-Scores, a span-level linguistically-grounded evaluation metric that categorizes system behavior across distinct disfluency types (EDITED, INTJ, PRN). Our deterministic alignment module enables robust mapping between generated text and disfluent transcripts, allowing Z-Scores to expose systematic weaknesses that word-level metrics obscure. By providing category-specific diagnostics, Z-Scores enable researchers to identify model failure modes and design targeted interventions -- such as tailored prompts or data augmentation -- yielding measurable performance improvements. A case study with LLMs shows that Z-Scores uncover challenges with INTJ and PRN disfluencies hidden in aggregate F1, directly informing model refinement strategies.","short_abstract":"Evaluating disfluency removal in speech requires more than aggregate token-level scores. Traditional word-based metrics such as precision, recall, and F1 (E-Scores) capture overall performance but cannot reveal why models succeed or fail. We introduce Z-Scores, a span-level linguistically-grounded evaluation metric tha...","url_abs":"https://arxiv.org/abs/2509.20319","url_pdf":"https://arxiv.org/pdf/2509.20319v1","authors":"[\"Maria Teleki\",\"Sai Janjur\",\"Haoran Liu\",\"Oliver Grabner\",\"Ketan Verma\",\"Thomas Docog\",\"Xiangjue Dong\",\"Lingfeng Shi\",\"Cong Wang\",\"Stephanie Birkelbach\",\"Jason Kim\",\"Yin Zhang\",\"James Caverlee\"]","published":"2025-09-24T17:02:39Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"eess.AS\"]","methods":"[\"Large Language Model\"]","has_code":false}