{"ID":2871211,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.19329","arxiv_id":"2509.19329","title":"How Model Size, Temperature, and Prompt Style Affect LLM-Human Assessment Score Alignment","abstract":"We examined how model size, temperature, and prompt style affect Large Language Models' (LLMs) alignment within itself, between models, and with human in assessing clinical reasoning skills. Model size emerged as a key factor in LLM-human score alignment. Study highlights the importance of checking alignments across multiple levels.","short_abstract":"We examined how model size, temperature, and prompt style affect Large Language Models' (LLMs) alignment within itself, between models, and with human in assessing clinical reasoning skills. Model size emerged as a key factor in LLM-human score alignment. Study highlights the importance of checking alignments across mu...","url_abs":"https://arxiv.org/abs/2509.19329","url_pdf":"https://arxiv.org/pdf/2509.19329v1","authors":"[\"Julie Jung\",\"Max Lu\",\"Sina Chole Benker\",\"Dogus Darici\"]","published":"2025-09-14T02:24:41Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"stat.ME\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}