{"ID":2859036,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.07551","arxiv_id":"2510.07551","title":"An Evaluation Study of Hybrid Methods for Multilingual PII Detection","abstract":"The detection of Personally Identifiable Information (PII) is critical for privacy compliance but remains challenging in low-resource languages due to linguistic diversity and limited annotated data. We present RECAP, a hybrid framework that combines deterministic regular expressions with context-aware large language models (LLMs) for scalable PII detection across 13 low-resource locales. RECAP's modular design supports over 300 entity types without retraining, using a three-phase refinement pipeline for disambiguation and filtering. Benchmarked with nervaluate, our system outperforms fine-tuned NER models by 82% and zero-shot LLMs by 17% in weighted F1-score. This work offers a scalable and adaptable solution for efficient PII detection in compliance-focused applications.","short_abstract":"The detection of Personally Identifiable Information (PII) is critical for privacy compliance but remains challenging in low-resource languages due to linguistic diversity and limited annotated data. We present RECAP, a hybrid framework that combines deterministic regular expressions with context-aware large language m...","url_abs":"https://arxiv.org/abs/2510.07551","url_pdf":"https://arxiv.org/pdf/2510.07551v1","authors":"[\"Harshit Rajgarhia\",\"Suryam Gupta\",\"Asif Shaik\",\"Gulipalli Praveen Kumar\",\"Y Santhoshraj\",\"Sanka Nithya Tanvy Nishitha\",\"Abhishek Mukherji\"]","published":"2025-10-08T21:03:59Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
