{"ID":2840270,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.14981","arxiv_id":"2511.14981","title":"Logit-Based Losses Limit the Effectiveness of Feature Knowledge Distillation","abstract":"Knowledge distillation (KD) methods can transfer knowledge of a parameter-heavy teacher model to a light-weight student model. The status quo for feature KD methods is to utilize loss functions based on logits (i.e., pre-softmax class scores) and intermediate layer features (i.e., latent representations). Unlike previous approaches, we propose a feature KD framework for training the student's backbone using feature-based losses exclusively (i.e., without logit-based losses such as cross entropy). Leveraging recent discoveries about the geometry of latent representations, we introduce a knowledge quality metric for identifying which teacher layers provide the most effective knowledge for distillation. Experiments on three image classification datasets with four diverse student-teacher pairs, spanning convolutional neural networks and vision transformers, demonstrate our KD method achieves state-of-the-art performance, delivering top-1 accuracy boosts of up to 15% over standard approaches. We publically share our code to facilitate future work at https://github.com/Thegolfingocto/KD_wo_CE.","short_abstract":"Knowledge distillation (KD) methods can transfer knowledge of a parameter-heavy teacher model to a light-weight student model. The status quo for feature KD methods is to utilize loss functions based on logits (i.e., pre-softmax class scores) and intermediate layer features (i.e., latent representations). Unlike previo...","url_abs":"https://arxiv.org/abs/2511.14981","url_pdf":"https://arxiv.org/pdf/2511.14981v1","authors":"[\"Nicholas Cooper\",\"Lijun Chen\",\"Sailesh Dwivedy\",\"Danna Gurari\"]","published":"2025-11-18T23:50:31Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Vision Transformer\",\"Transformer\"]","has_code":false,"code_links":[{"ID":606955,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2840270,"paper_url":"https://arxiv.org/abs/2511.14981","paper_title":"Logit-Based Losses Limit the Effectiveness of Feature Knowledge Distillation","repo_url":"https://github.com/Thegolfingocto/KD_wo_CE","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}