{"ID":2846424,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.03047","arxiv_id":"2511.03047","title":"Unsupervised Evaluation of Multi-Turn Objective-Driven Interactions","abstract":"Large language models (LLMs) have seen increasing popularity in enterprise applications where AI agents and humans engage in objective-driven interactions. However, these systems are difficult to evaluate: data may be complex and unlabeled; human annotation is often impractical at scale; custom metrics can monitor for specific errors, but not previously-undetected ones; and LLM judges can produce unreliable results. We introduce the first set of unsupervised metrics for objective-driven interactions, leveraging statistical properties of unlabeled interaction data and using fine-tuned LLMs to adapt to distributional shifts. We develop metrics for labeling user goals, measuring goal completion, and quantifying LLM uncertainty without grounding evaluations in human-generated ideal responses. Our approach is validated on open-domain and task-specific interaction data.","short_abstract":"Large language models (LLMs) have seen increasing popularity in enterprise applications where AI agents and humans engage in objective-driven interactions. However, these systems are difficult to evaluate: data may be complex and unlabeled; human annotation is often impractical at scale; custom metrics can monitor for...","url_abs":"https://arxiv.org/abs/2511.03047","url_pdf":"https://arxiv.org/pdf/2511.03047v1","authors":"[\"Emi Soroka\",\"Tanmay Chopra\",\"Krish Desai\",\"Sanjay Lall\"]","published":"2025-11-04T22:44:27Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}