{"ID":2842934,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.11711","arxiv_id":"2511.11711","title":"Which Sparse Autoencoder Features Are Real? Model-X Knockoffs for False Discovery Rate Control","abstract":"Although sparse autoencoders (SAEs) are crucial for identifying interpretable features in neural networks, it is still challenging to distinguish between real computational patterns and erroneous correlations. We introduce Model-X knockoffs to SAE feature selection, using knock-off+ to control the false discovery rate (FDR) with finite-sample guarantees under the standard Model-X assumptions (in our case, via a Gaussian surrogate for the latent distribution). We select 129 features at a target FDR q=0.1 after analyzing 512 high-activity SAE latents for sentiment classification using Pythia-70M. About 25% of the latents under examination carry task-relevant signal, whereas 75% do not, according to the chosen set, which displays a 5.40x separation in knockoff statistics compared to non-selected features. Our method offers a re-producible and principled framework for reliable feature discovery by combining SAEs with multiple-testing-aware inference, advancing the foundations of mechanistic interpretability.","short_abstract":"Although sparse autoencoders (SAEs) are crucial for identifying interpretable features in neural networks, it is still challenging to distinguish between real computational patterns and erroneous correlations. We introduce Model-X knockoffs to SAE feature selection, using knock-off+ to control the false discovery rate...","url_abs":"https://arxiv.org/abs/2511.11711","url_pdf":"https://arxiv.org/pdf/2511.11711v1","authors":"[\"Tsogt-Ochir Enkhbayar\"]","published":"2025-11-12T17:12:45Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[]","has_code":false}