{"ID":2897263,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.06427","arxiv_id":"2507.06427","title":"Exploring Task Performance with Interpretable Models via Sparse Auto-Encoders","abstract":"Large Language Models (LLMs) are traditionally viewed as black-box algorithms, therefore reducing trustworthiness and obscuring potential approaches to increasing performance on downstream tasks. In this work, we apply an effective LLM decomposition method using a dictionary-learning approach with sparse autoencoders. This helps extract monosemantic features from polysemantic LLM neurons. Remarkably, our work identifies model-internal misunderstanding, allowing the automatic reformulation of the prompts with additional annotations to improve the interpretation by LLMs. Moreover, this approach demonstrates a significant performance improvement in downstream tasks, such as mathematical reasoning and metaphor detection.","short_abstract":"Large Language Models (LLMs) are traditionally viewed as black-box algorithms, therefore reducing trustworthiness and obscuring potential approaches to increasing performance on downstream tasks. In this work, we apply an effective LLM decomposition method using a dictionary-learning approach with sparse autoencoders....","url_abs":"https://arxiv.org/abs/2507.06427","url_pdf":"https://arxiv.org/pdf/2507.06427v1","authors":"[\"Shun Wang\",\"Tyler Loakman\",\"Youbo Lei\",\"Yi Liu\",\"Bohao Yang\",\"Yuting Zhao\",\"Dong Yang\",\"Chenghua Lin\"]","published":"2025-07-08T22:17:52Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
