{"ID":2872153,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.09318","arxiv_id":"2509.09318","title":"Efficient Transformer-Based Piano Transcription With Sparse Attention Mechanisms","abstract":"This paper investigates automatic piano transcription based on computationally-efficient yet high-performant variants of the Transformer that can capture longer-term dependency over the whole musical piece. Recently, transformer-based sequence-to-sequence models have demonstrated excellent performance in piano transcription. These models, however, fail to deal with the whole piece at once due to the quadratic complexity of the self-attention mechanism, and music signals are thus typically processed in a sliding-window manner in practice. To overcome this limitation, we propose an efficient architecture with sparse attention mechanisms. Specifically, we introduce sliding-window self-attention mechanisms for both the encoder and decoder, and a hybrid global-local cross-attention mechanism that attends to various spans according to the MIDI token types. We also use a hierarchical pooling strategy between the encoder and decoder to further reduce computational load. Our experiments on the MAESTRO dataset showed that the proposed model achieved a significant reduction in computational cost and memory usage, accelerating inference speed, while maintaining transcription performance comparable to the full-attention baseline. This allows for training with longer audio contexts on the same hardware, demonstrating the viability of sparse attention for building efficient and high-performance piano transcription systems. The code is available at https://github.com/WX-Wei/efficient-seq2seq-piano-trans.","short_abstract":"This paper investigates automatic piano transcription based on computationally-efficient yet high-performant variants of the Transformer that can capture longer-term dependency over the whole musical piece. Recently, transformer-based sequence-to-sequence models have demonstrated excellent performance in piano transcri...","url_abs":"https://arxiv.org/abs/2509.09318","url_pdf":"https://arxiv.org/pdf/2509.09318v1","authors":"[\"Weixing Wei\",\"Kazuyoshi Yoshii\"]","published":"2025-09-11T10:02:11Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.MM\"]","methods":"[\"Transformer\"]","has_code":false,"code_links":[{"ID":609946,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2872153,"paper_url":"https://arxiv.org/abs/2509.09318","paper_title":"Efficient Transformer-Based Piano Transcription With Sparse Attention Mechanisms","repo_url":"https://github.com/WX-Wei/efficient-seq2seq-piano-trans","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
