{"ID":2921862,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-03T20:38:10.546707057Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.01451","arxiv_id":"2606.01451","title":"Before and After Temperature: A Distributional View of Creative LLM Generation","abstract":"Reference-free evaluation of large language model (LLM) creativity relies on perplexity, entropy, and top-1 margin. We show that a much stronger signal lives one step earlier in the pipeline: in how sampling temperature \\emph{reshapes} the model's token distribution before the next token is drawn. On Llama-3.1-8B-Instruct generations of 500 open-ended creative prompts at $T \\in \\{0.3, 0.8, 1.5\\}$, a single per-token feature derived from this reshaping predicts the within-prompt creativity rank at Spearman $ρ{=}0.918$ against an averaged gpt-4o\\,/\\,gemini-2.5-pro judge ($n{=}500$) and $ρ{=}0.870$ against a three-rater human-majority ranking ($n{=}150$). Each of four standard reference-free baselines (self-perplexity, mean predictive entropy, top-1 margin, gzip compression ratio) tops out at $|ρ|\\!\\approx\\!0.76$ on both ground truths: a gap of $+0.165$ on averaged-LLM and $+0.110$ on human-majority, both far larger than the spread among the baselines themselves. The two ground-truth panels agree with each other at $ρ{=}0.83$, above the inter-human ceiling of $ρ{=}0.77$, so the comparison is not bottlenecked by judge noise. Mechanistically, the win comes from a sharp distributional signature of the incoherence regime: at $T{=}1.5$ the cumulative-mass width $n_{95}(q)$ inflates from $\\sim\\!1$ to ${\\sim}\\!131$ tokens and post-temperature mass leaks off the pre-temperature top-$90\\%$ plausible set by about $13$ percentage points. The per-token aggregates do not separate $T{=}0.8$ from $T{=}0.3$; discriminating the two coherent regimes is left to sequence-level features.","short_abstract":"Reference-free evaluation of large language model (LLM) creativity relies on perplexity, entropy, and top-1 margin. We show that a much stronger signal lives one step earlier in the pipeline: in how sampling temperature \\emph{reshapes} the model's token distribution before the next token is drawn. On Llama-3.1-8B-Instr...","url_abs":"https://arxiv.org/abs/2606.01451","url_pdf":"https://arxiv.org/pdf/2606.01451v1","authors":"[\"V. S. Raghu Parupudi\",\"Harsha Ponnada\",\"Aditi Kaushal\",\"S. Shria Parupudi\",\"Saiteja Dasari\",\"Sahiti Bulusu\"]","published":"2026-05-31T21:13:47Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
