{"ID":2854714,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.14812","arxiv_id":"2510.14812","title":"Efficient Dynamic Structured Sparse Training with Learned Shuffles","abstract":"Structured sparsity accelerates training and inference on modern GPUs, yet it still trails unstructured dynamic sparse training (DST) in accuracy. The shortfall stems from a loss of expressivity: whereas a dense layer can realize every possible mask obtained by choosing any $w$ active weights out of $n$, a fixed block or N:M layout explores only a subset of those possibilities. We propose to close this gap by learning, for each layer, a single permutation matrix jointly with the structured weight matrix. Applied to three canonical structures -- block, N:M, and diagonals -- we show that permutation-augmented DST (PA-DST) matches unstructured baselines (RigL, SET) at 90--95\\% sparsity on ImageNet-1K (ViT-B/16) and WikiText-103 (GPT-2), yet trains up to $1.21\\times$ and infers up to $2.9\\times$ faster. The results position structure + learned permutation as a sweet spot between accuracy and efficiency.","short_abstract":"Structured sparsity accelerates training and inference on modern GPUs, yet it still trails unstructured dynamic sparse training (DST) in accuracy. The shortfall stems from a loss of expressivity: whereas a dense layer can realize every possible mask obtained by choosing any $w$ active weights out of $n$, a fixed block...","url_abs":"https://arxiv.org/abs/2510.14812","url_pdf":"https://arxiv.org/pdf/2510.14812v1","authors":"[\"Abhishek Tyagi\",\"Arjun Iyer\",\"Liam Young\",\"William H Renninger\",\"Christopher Kanan\",\"Yuhao Zhu\"]","published":"2025-10-16T15:48:17Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[]","has_code":false}