{"ID":2849047,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.24295","arxiv_id":"2510.24295","title":"MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference","abstract":"In recent years, many generalization benchmarks have shown language models' lack of robustness in natural language inference (NLI). However, manually creating new benchmarks is costly, while automatically generating high-quality ones, even by modifying existing benchmarks, is extremely difficult. In this paper, we propose a methodology for automatically generating high-quality variants of original NLI problems by replacing open-class words, while crucially preserving their underlying reasoning. We dub our generalization test as MERGE (Minimal Expression-Replacements GEneralization), which evaluates the correctness of models' predictions across reasoning-preserving variants of the original problem. Our results show that NLI models' perform 4-20% worse on variants, suggesting low generalizability even on such minimally altered problems. We also analyse how word class of the replacements, word probability, and plausibility influence NLI models' performance.","short_abstract":"In recent years, many generalization benchmarks have shown language models' lack of robustness in natural language inference (NLI). However, manually creating new benchmarks is costly, while automatically generating high-quality ones, even by modifying existing benchmarks, is extremely difficult. In this paper, we prop...","url_abs":"https://arxiv.org/abs/2510.24295","url_pdf":"https://arxiv.org/pdf/2510.24295v1","authors":"[\"Mădălina Zgreabăn\",\"Tejaswini Deoskar\",\"Lasha Abzianidze\"]","published":"2025-10-28T10:58:59Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Language Model\"]","has_code":false}
