The IRMA Dataset: A Structured Audio-MIDI Corpus for Iranian Classical Music

Aug 27, 2025 cs.SD arXiv:2508.19876

Abstract

We present the IRMA Dataset (Iranian Radif MIDI Audio), a multi-level, open-access corpus designed for the computational study of Iranian classical music, with a particular emphasis on the radif, a structured repertoire of modal-melodic units central to pedagogy and performance. The dataset combines symbolic MIDI representations, phrase-level audio-MIDI alignment, musicological transcriptions in PDF format, and comparative tables of theoretical information curated from a range of performers and scholars. We outline the multi-phase construction process, including segment annotation, alignment methods, and a structured system of identifier codes to reference individual musical units. The current release includes the complete radif of Karimi; MIDI files and metadata from Mirza Abdollah's radif; selected segments from the vocal radif of Davami, as transcribed by Payvar and Fereyduni; and a dedicated section featuring audio-MIDI examples of tahrir ornamentation performed by prominent 20th-century vocalists. While the symbolic and analytical components are released under an open-access license (CC BY-NC 4.0), some referenced audio recordings and third-party transcriptions are cited using discographic information to enable users to locate the original materials independently, pending copyright permission. Serving both as a scholarly archive and a resource for computational analysis, this dataset supports applications in ethnomusicology, pedagogy, symbolic audio research, cultural heritage preservation, and AI-driven tasks such as automatic transcription and music generation. We welcome collaboration and feedback to support its ongoing refinement and broader integration into musicological and machine learning workflows.

Abstract

PDF Viewer