{"ID":2872828,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.07309","arxiv_id":"2509.07309","title":"PIE: Performance Interval Estimation for Free-Form Generation Tasks","abstract":"Confidence estimation infers a probability for whether each model output is correct or not. While predicting such binary correctness is sensible for tasks with exact answers, free-form generation tasks are often more nuanced, with output quality being both fine-grained and multi-faceted. We thus propose Performance Interval Estimation (PIE) to predict both: 1) point estimates for any arbitrary set of continuous-valued evaluation metrics; and 2) calibrated uncertainty intervals around these point estimates. We then compare two approaches: LLM-as-judge vs. classic regression with confidence estimation features. Evaluation over 11 datasets spans summarization, translation, code generation, function-calling, and question answering. Regression is seen to achieve both: i) lower error point estimates of metric scores; and ii) well-calibrated uncertainty intervals. To support reproduction and follow-on work, we share our data and code.","short_abstract":"Confidence estimation infers a probability for whether each model output is correct or not. While predicting such binary correctness is sensible for tasks with exact answers, free-form generation tasks are often more nuanced, with output quality being both fine-grained and multi-faceted. We thus propose Performance Int...","url_abs":"https://arxiv.org/abs/2509.07309","url_pdf":"https://arxiv.org/pdf/2509.07309v2","authors":"[\"Chi-Yang Hsu\",\"Alexander Braylan\",\"Yiheng Su\",\"Matthew Lease\",\"Omar Alonso\"]","published":"2025-09-09T00:59:34Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.LG\"]","methods":"[\"Large Language Model\"]","has_code":false}