{"ID":2842146,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.10070","arxiv_id":"2511.10070","title":"ADI-20: Arabic Dialect Identification dataset and models","abstract":"We present ADI-20, an extension of the previously published ADI-17 Arabic Dialect Identification (ADI) dataset. ADI-20 covers all Arabic-speaking countries' dialects. It comprises 3,556 hours from 19 Arabic dialects in addition to Modern Standard Arabic (MSA). We used this dataset to train and evaluate various state-of-the-art ADI systems. We explored fine-tuning pre-trained ECAPA-TDNN-based models, as well as Whisper encoder blocks coupled with an attention pooling layer and a classification dense layer. We investigated the effect of (i) training data size and (ii) the model's number of parameters on identification performance. Our results show a small decrease in F1 score while using only 30% of the original training data. We open-source our collected data and trained models to enable the reproduction of our work, as well as support further research in ADI.","short_abstract":"We present ADI-20, an extension of the previously published ADI-17 Arabic Dialect Identification (ADI) dataset. ADI-20 covers all Arabic-speaking countries' dialects. It comprises 3,556 hours from 19 Arabic dialects in addition to Modern Standard Arabic (MSA). We used this dataset to train and evaluate various state-of...","url_abs":"https://arxiv.org/abs/2511.10070","url_pdf":"https://arxiv.org/pdf/2511.10070v1","authors":"[\"Haroun Elleuch\",\"Salima Mdhaffar\",\"Yannick Estève\",\"Fethi Bougares\"]","published":"2025-11-13T08:17:00Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[]","has_code":false}
