{"ID":3006079,"CreatedAt":"2026-06-03T03:09:48.883664427Z","UpdatedAt":"2026-06-04T19:14:31.964469513Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.02866","arxiv_id":"2606.02866","title":"When Helping Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning","abstract":"When does multi-agent debate help data cleaning, and when does it hurt? Across three benchmarks, four model families, and over 6,000 task-condition pairs, we find debate's effect reverses sign: it degrades generation across all four models (-1.6 to -15.5pp) through critique-induced confusion (CIC), hallucinated Critic feedback that the Generator accepts uncritically, yet improves error detection (+27.4pp F1, d=1.0). We derive a debate benefit condition: debate helps when the probability of rescuing a wrong output (Critic verification odds weighted by fixability) exceeds the probability of destroying a correct one. A factorial experiment proves adversarial separation is essential: self-verification with identical tools fails, while a separate Critic with code-execution grounding and evidence-gated generation produces the first debate configuration to significantly exceed single-agent on a generative task (+5.3pp, p\u003c0.05). The condition correctly predicts all nine task types and generalizes with zero false positives across 19 published comparisons in seven domains.","short_abstract":"When does multi-agent debate help data cleaning, and when does it hurt? Across three benchmarks, four model families, and over 6,000 task-condition pairs, we find debate's effect reverses sign: it degrades generation across all four models (-1.6 to -15.5pp) through critique-induced confusion (CIC), hallucinated Critic...","url_abs":"https://arxiv.org/abs/2606.02866","url_pdf":"https://arxiv.org/pdf/2606.02866v1","authors":"[\"Chirag Parmar\",\"Akshat Mehta\",\"Henglin Wu\",\"Jagadish Ramamurthy\",\"Shweta Medhekar\"]","published":"2026-06-01T20:29:47Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CL\",\"cs.MA\"]","methods":"[]","has_code":false}