{"ID":2868623,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.15625","arxiv_id":"2509.15625","title":"The Rhythm In Anything: Audio-Prompted Drums Generation with Masked Language Modeling","abstract":"Musicians and nonmusicians alike use rhythmic sound gestures, such as tapping and beatboxing, to express drum patterns. While these gestures effectively communicate musical ideas, realizing these ideas as fully-produced drum recordings can be time-consuming, potentially disrupting many creative workflows. To bridge this gap, we present TRIA (The Rhythm In Anything), a masked transformer model for mapping rhythmic sound gestures to high-fidelity drum recordings. Given an audio prompt of the desired rhythmic pattern and a second prompt to represent drumkit timbre, TRIA produces audio of a drumkit playing the desired rhythm (with appropriate elaborations) in the desired timbre. Subjective and objective evaluations show that a TRIA model trained on less than 10 hours of publicly-available drum data can generate high-quality, faithful realizations of sound gestures across a wide range of timbres in a zero-shot manner.","short_abstract":"Musicians and nonmusicians alike use rhythmic sound gestures, such as tapping and beatboxing, to express drum patterns. While these gestures effectively communicate musical ideas, realizing these ideas as fully-produced drum recordings can be time-consuming, potentially disrupting many creative workflows. To bridge thi...","url_abs":"https://arxiv.org/abs/2509.15625","url_pdf":"https://arxiv.org/pdf/2509.15625v1","authors":"[\"Patrick O'Reilly\",\"Julia Barnett\",\"Hugo Flores García\",\"Annie Chu\",\"Nathan Pruyne\",\"Prem Seetharaman\",\"Bryan Pardo\"]","published":"2025-09-19T05:41:51Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"eess.AS\"]","methods":"[\"Transformer\",\"Language Model\"]","has_code":false}