{"ID":2832145,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.06227","arxiv_id":"2512.06227","title":"Automated Data Enrichment using Confidence-Aware Fine-Grained Debate among Open-Source LLMs for Mental Health and Online Safety","abstract":"Real-world indicators play an important role in many natural language processing (NLP) applications, such as life-event for mental health analysis and risky behaviour for online safety, yet labelling such information in training datasets is often costly and/or difficult due to their dynamic nature. Large language models (LLMs) show promising potential for automated annotation, yet multi-label prediction remains challenging. In this work, we propose a Confidence-Aware Fine-Grained Debate (CFD) framework that simulates collaborative annotation using fine-grained information to better support automated multi-label enrichment. We introduce two new expert-annotated resources: A mental health Reddit well-being dataset and an online safety Facebook sharenting risk dataset. Experiments show that CFD achieves the most robust enrichment performance compared to a range of baseline approaches. We further evaluate various training-free enrichment incorporation strategies and demonstrate that LLM-enriched indicators consistently improves our downstream tasks. Enriched features incorporated via debate transcripts yield the largest gains, outperforming the non-enriched baseline by 9.9\\% on the online safety task.","short_abstract":"Real-world indicators play an important role in many natural language processing (NLP) applications, such as life-event for mental health analysis and risky behaviour for online safety, yet labelling such information in training datasets is often costly and/or difficult due to their dynamic nature. Large language model...","url_abs":"https://arxiv.org/abs/2512.06227","url_pdf":"https://arxiv.org/pdf/2512.06227v2","authors":"[\"Junyu Mao\",\"Anthony Hills\",\"Talia Tseriotou\",\"Maria Liakata\",\"Aya Shamir\",\"Dan Sayda\",\"Dana Atzil-Slonim\",\"Natalie Djohari\",\"Arpan Mandal\",\"Silke Roth\",\"Pamela Ugwudike\",\"Mahesan Niranjan\",\"Stuart E. Middleton\"]","published":"2025-12-06T00:21:29Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
