Multi-LLM Orchestration for High-Quality Code Generation: Exploiting Complementary Model Strengths

cs.SE arXiv:2510.01379
View PDF arXiv JSON

Abstract

Large Language Models (LLMs) have become central to automated code generation, yet existing approaches operate within a single-LLM paradigm: one model is selected and applied throughout the entire generation process. We observe that different LLMs exhibit complementary strengths: no single model dominates across all programming languages, algorithmic problem categories, or development stages. Multi-LLM collaboration, structured as per-stage, per-category routing rather than majority voting, produces higher-quality code than any individual model. Based on this observation, we propose PerfOrch, a multi-agent orchestration system that decomposes code generation into four collaborative agents: categorization, generation, debugging, and refinement. Each agent maintains a Memory module: a ranking matrix indexed by programming language and problem category, constructed from offline profiling and consulted at runtime to select the most suitable model for each task. We evaluate PerfOrch on two benchmarks, HumanEval-X and EffiBench-X, totaling 2,500 problems across five languages (Python, Java, C++, Go, and Rust). PerfOrch achieves average pass@1 rates of 97.19% on HumanEval-X and 95.83% on EffiBench-X, improving over the strongest single-model pipeline by 1.22-14.58 percentage points across languages. Notably, Memory rankings constructed solely from HumanEval-X profiling generalize to the entirely unseen EffiBench-X benchmark without re-profiling, demonstrating that the complementary-strength patterns PerfOrch exploits are properties of the models rather than artifacts of a specific problem distribution. Beyond correctness, PerfOrch improves execution time for 61-90% of solved problems with mean speedups of 4.7-29.9%, matching the refinement coverage of exhaustive multi-model evaluation at roughly half the token cost.

PDF Viewer