{"ID":2860100,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.05365","arxiv_id":"2510.05365","title":"Test Case Generation from Bug Reports via Large Language Models: A Cognitive Layered Evaluation Framework","abstract":"Large Language Models (LLMs) are increasingly applied to automated software testing, yet their ability to generalize beyond memorized patterns and reason about natural language bug reports remains unclear. We present a systematic evaluation of LLM reasoning in test case generation, structured around the cognitive layers of Bloom's taxonomy: \\textit{Remember}, \\textit{Understand}, \\textit{Apply}, \\textit{Analyze}, \\textit{Evaluate}, and \\textit{Create}, which progressively assess higher levels of cognitive and reasoning capabilities. Building on the LIBRO framework, we evaluate StarCoder and GPT-4o on Defects4J, GHRB, and mutated variants that introduce linguistic and semantic challenges. Our findings show that both models largely reproduce prior results with minor deviations (\\textit{Remember}), exhibit partial robustness to linguistic rephrasings and translations while uncovering unique reproducible bugs (\\textit{Understand}), but suffer severe performance drops exceeding 60\\% under identifier mutations (\\textit{Apply}). Conversely, providing near-identical few-shot examples in an open-book setting improves success rates by up to three times, and component-level analysis reveals that structured technical elements, such as test code and method names, are far more impactful than narrative descriptions for successful test generation (\\textit{Analyze}). These insights illuminate the cognitive processes underlying LLM-generated tests, suggest concrete directions for improving performance, and establish a robust and realistic evaluation paradigm for this task.","short_abstract":"Large Language Models (LLMs) are increasingly applied to automated software testing, yet their ability to generalize beyond memorized patterns and reason about natural language bug reports remains unclear. We present a systematic evaluation of LLM reasoning in test case generation, structured around the cognitive layer...","url_abs":"https://arxiv.org/abs/2510.05365","url_pdf":"https://arxiv.org/pdf/2510.05365v1","authors":"[\"Irtaza Sajid Qureshi\",\"Zhen Ming\",\"Jiang\"]","published":"2025-10-06T20:47:12Z","proceeding":"cs.SE","tasks":"[\"cs.SE\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}