{"ID":3084746,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-07T00:57:50.230973856Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05548","arxiv_id":"2606.05548","title":"ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer","abstract":"The rapid proliferation of Agent Development Kits (ADKs), SDK-level frameworks for building LLM-powered autonomous agents, has outpaced any empirical understanding of how framework choice affects agent performance. We propose \\textbf{LLM-as-a-Developer}, a methodology that replaces human developers with an LLM coding agent that learns each framework's API from documentation, writes agent code, and iteratively repairs it through a validate-and-feedback loop until tests pass. By holding the developer constant and varying only the framework, generation effort becomes a quantitative proxy for API usability and the resulting agents provide a controlled measure of framework effectiveness. We implement this in \\textbf{ADK Arena}, a fully automated pipeline with per-framework Docker isolation, a three-level validation pipeline, and benchmark adapters for SWE-bench, $τ^2$-bench, Terminal-Bench, and MCP-Atlas. Evaluating all 51 popular Python ADK frameworks (204 agent--benchmark pairs), we find that: (1)~generation succeeds for 57\\% of runs, and its cost varies 5.6$\\times$ across frameworks (\\$0.6 to \\$3.4 per agent), a quantitative proxy for API complexity, though cost alone does not predict success; (2)~no single framework dominates: the best single-benchmark ADK agents resolve up to 80\\% of tasks and can even \\emph{beat} general-purpose frontier coding agents at a fraction of the cost, yet the median framework resolves only 32\\%; (3)~across information-source ablations, genuine framework usage stays within a narrow 28--40\\% band (highest with raw source access and still 33\\% with no reference material at all), indicating that documentation, source code, and parametric knowledge are largely substitutable rather than any one being a hard bottleneck.","short_abstract":"The rapid proliferation of Agent Development Kits (ADKs), SDK-level frameworks for building LLM-powered autonomous agents, has outpaced any empirical understanding of how framework choice affects agent performance. We propose \\textbf{LLM-as-a-Developer}, a methodology that replaces human developers with an LLM coding a...","url_abs":"https://arxiv.org/abs/2606.05548","url_pdf":"https://arxiv.org/pdf/2606.05548v1","authors":"[\"Jintao Huang\",\"Xiaomin Li\",\"Gaurav Mittal\",\"Yu Hu\"]","published":"2026-06-04T01:00:54Z","proceeding":"cs.SE","tasks":"[\"cs.SE\",\"cs.AI\"]","methods":"[\"Large Language Model\"]","has_code":false}
