{"ID":2851971,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.18904","arxiv_id":"2510.18904","title":"DuoLens: A Framework for Robust Detection of Machine-Generated Multilingual Text and Code","abstract":"The prevalence of Large Language Models (LLMs) for generating multilingual text and source code has only increased the imperative for machine-generated content detectors to be accurate and efficient across domains. Current detectors, predominantly utilizing zero-shot methods, such as Fast DetectGPT or GPTZero, either incur high computational cost or lack sufficient accuracy, often with a trade-off between the two, leaving room for further improvement. To address these gaps, we propose the fine-tuning of encoder-only Small Language Models (SLMs), in particular, the pre-trained models of RoBERTA and CodeBERTa using specialized datasets on source code and other natural language to prove that for the task of binary classification, SLMs outperform LLMs by a huge margin whilst using a fraction of compute. Our encoders achieve AUROC $= 0.97$ to $0.99$ and macro-F1 $0.89$ to $0.94$ while reducing latency by $8$-$12\\times$ and peak VRAM by $3$-$5\\times$ at $512$-token inputs. Under cross-generator shifts and adversarial transformations (paraphrase, back-translation; code formatting/renaming), performance retains $\\geq 92%$ of clean AUROC. We release training and evaluation scripts with seeds and configs; a reproducibility checklist is also included.","short_abstract":"The prevalence of Large Language Models (LLMs) for generating multilingual text and source code has only increased the imperative for machine-generated content detectors to be accurate and efficient across domains. Current detectors, predominantly utilizing zero-shot methods, such as Fast DetectGPT or GPTZero, either i...","url_abs":"https://arxiv.org/abs/2510.18904","url_pdf":"https://arxiv.org/pdf/2510.18904v1","authors":"[\"Shriyansh Agrawal\",\"Aidan Lau\",\"Sanyam Shah\",\"Ahan M R\",\"Kevin Zhu\",\"Sunishchal Dev\",\"Vasu Sharma\"]","published":"2025-10-21T00:17:00Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.IR\",\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
