{"ID":2921888,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-03T22:46:55.310989306Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.01490","arxiv_id":"2606.01490","title":"LLM Consortium for Software Design Refinement: A Controlled Experiment on Multi-Agent Collaboration Topologies","abstract":"We present a controlled experiment evaluating 12 multi-agent LLM collaboration topologies for software architecture design. Using a $2\\times2\\times2$ factorial design (Authority $\\times$ Roles $\\times$ Dynamics), we conducted 520 experimental runs across 8 design tasks of varying complexity, with 5 repetitions each. Designs were evaluated on a 12-dimensional rubric by three independent automated evaluators (GPT-OSS 120B, Claude Opus 4.6, Claude Sonnet 4.6). We report four core findings. First, structural adversarial (v4b) ranks #1 by ensemble -- a prompt-engineered adversarial variant that demands rewrite mandates rather than patches (weighted ensemble: 4.637/5.0). Second, cross-model review wins unanimously at #2 -- generate with one model, review with another -- ranking #2 by all three evaluators (weighted ensemble: 4.606). Third, evaluator diversity is itself a finding -- all three evaluators agree v4b is best and v3 is worst, but disagree sharply on v2b (Claude d=1.44 vs. GPT-OSS d=0.45), revealing how different model families weight design qualities. Fourth, parallel merge is fundamentally broken -- all three evaluators place merge variants in the bottom tier (3.65-3.79), due to token starvation and the Frankenstein effect. The weighted ensemble ($2\\times$Opus + $2\\times$Sonnet + $1\\times$GPT-OSS) provides robust rankings across 520 runs, confirmed through independent cross-validation.","short_abstract":"We present a controlled experiment evaluating 12 multi-agent LLM collaboration topologies for software architecture design. Using a $2\\times2\\times2$ factorial design (Authority $\\times$ Roles $\\times$ Dynamics), we conducted 520 experimental runs across 8 design tasks of varying complexity, with 5 repetitions each. De...","url_abs":"https://arxiv.org/abs/2606.01490","url_pdf":"https://arxiv.org/pdf/2606.01490v1","authors":"[\"Nagarjuna Kanamarlapudi\",\"Praveen K\"]","published":"2026-05-31T23:15:40Z","proceeding":"cs.SE","tasks":"[\"cs.SE\",\"cs.AI\",\"cs.MA\"]","methods":"[\"Large Language Model\"]","has_code":false}