{"ID":2839634,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.15572","arxiv_id":"2511.15572","title":"From Per-Image Low-Rank to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers","abstract":"Feature-map knowledge distillation (KD) transfers internal representations well between comparably sized Vision Transformers (ViTs), but it often fails in compression. We revisit this failure and uncover a paradox. Sample-wise SVD shows that each image is highly compressible, which seems to suggest that a narrow student with a linear projector should match the teacher \"in principle\". However, a dataset-level view contradicts this intuition: PCA shows that the teacher is a union of low-rank subspaces with significant subspace rotation across inputs. We further introduce token-level Spectral Energy Patterns (SEP) and find an architecture-invariant encoding law: tokens spread energy broadly across channel modes even when they live in low-rank subspace, creating a bandwidth mismatch. We refer to this combined phenomenon as an encoding mismatch. We propose two minimal remedies, Lift or WideLast: (i) Lift retains a lightweight lifting projector at inference to provide wider channel, or (ii) WideLast widens only the student's last block, enabling an input-dependent expansion. On ImageNet-1K, these fixes revive feature KD for ViT compression, improving DeiT-Tiny distilled from CaiT-S24 from 74.86% to 77.53%/78.23% top-1 accuracy, and they also strengthen students trained without distillation. Our analyses clarify when and why feature-map KD fails and then how to fix it. Code and raw data are provided in https://github.com/thy960112/From-Per-Image-Low-Rank-to-Encoding-Mismatch.","short_abstract":"Feature-map knowledge distillation (KD) transfers internal representations well between comparably sized Vision Transformers (ViTs), but it often fails in compression. We revisit this failure and uncover a paradox. Sample-wise SVD shows that each image is highly compressible, which seems to suggest that a narrow studen...","url_abs":"https://arxiv.org/abs/2511.15572","url_pdf":"https://arxiv.org/pdf/2511.15572v3","authors":"[\"Huiyuan Tian\",\"Bonan Xu\",\"Shijian Li\"]","published":"2025-11-19T16:03:21Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Vision Transformer\",\"Transformer\"]","has_code":false,"code_links":[{"ID":606890,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2839634,"paper_url":"https://arxiv.org/abs/2511.15572","paper_title":"From Per-Image Low-Rank to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers","repo_url":"https://github.com/thy960112/From-Per-Image-Low-Rank-to-Encoding-Mismatch","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
