The IRMA Dataset: A Structured Audio-MIDI Corpus for Iranian Classical Music
Abstract
We present the IRMA Dataset (Iranian Radif MIDI Audio), a multi-level, open-access corpus designed for the computational study of Iranian classical music, with a particular emphasis on the radif, a structured repertoire of modal-melodic units central to pedagogy and performance. The dataset combines symbolic MIDI representations, phrase-level audio-MIDI alignment, musicological transcriptions in PDF format, and comparative tables of theoretical information curated from a range of performers and scholars. We outline the multi-phase construction process, including segment annotation, alignment methods, and a structured system of identifier codes to reference individual musical units. The current release includes the complete radif of Karimi; MIDI files and metadata from Mirza Abdollah's radif; selected segments from the vocal radif of Davami, as transcribed by Payvar and Fereyduni; and a dedicated section featuring audio-MIDI examples of tahrir ornamentation performed by prominent 20th-century vocalists. While the symbolic and analytical components are released under an open-access license (CC BY-NC 4.0), some referenced audio recordings and third-party transcriptions are cited using discographic information to enable users to locate the original materials independently, pending copyright permission. Serving both as a scholarly archive and a resource for computational analysis, this dataset supports applications in ethnomusicology, pedagogy, symbolic audio research, cultural heritage preservation, and AI-driven tasks such as automatic transcription and music generation. We welcome collaboration and feedback to support its ongoing refinement and broader integration into musicological and machine learning workflows.