{"ID":2885310,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.05508","arxiv_id":"2508.05508","title":"Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation","abstract":"The increasing adoption of foundation models as agents across diverse domains necessitates a robust evaluation framework. Current methods, such as LLM-as-a-Judge, focus only on final outputs, overlooking the step-by-step reasoning that drives agentic decision-making. Meanwhile, existing Agent-as-a-Judge systems, where one agent evaluates another's task completion, are typically designed for narrow, domain-specific settings. To address this gap, we propose a generalizable, modular framework for evaluating agent task completion independent of the task domain. The framework emulates human-like evaluation by decomposing tasks into sub-tasks and validating each step using available information, such as the agent's output and reasoning. Each module contributes to a specific aspect of the evaluation process, and their outputs are aggregated to produce a final verdict on task completion. We validate our framework by evaluating the Magentic-One Actor Agent on two benchmarks, GAIA and BigCodeBench. Our Judge Agent predicts task success with closer agreement to human evaluations, achieving 4.76% and 10.52% higher alignment accuracy, respectively, compared to the GPT-4o based LLM-as-a-Judge baseline. This demonstrates the potential of our proposed general-purpose evaluation framework.","short_abstract":"The increasing adoption of foundation models as agents across diverse domains necessitates a robust evaluation framework. Current methods, such as LLM-as-a-Judge, focus only on final outputs, overlooking the step-by-step reasoning that drives agentic decision-making. Meanwhile, existing Agent-as-a-Judge systems, where...","url_abs":"https://arxiv.org/abs/2508.05508","url_pdf":"https://arxiv.org/pdf/2508.05508v1","authors":"[\"Roshita Bhonsle\",\"Rishav Dutta\",\"Sneha Vavilapalli\",\"Harsh Seth\",\"Abubakarr Jaye\",\"Yapei Chang\",\"Mukund Rungta\",\"Emmanuel Aboah Boateng\",\"Sadid Hasan\",\"Ehi Nosakhare\",\"Soundar Srinivasan\"]","published":"2025-08-07T15:39:48Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Large Language Model\"]","has_code":false}
