{"ID":2865192,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.22755","arxiv_id":"2509.22755","title":"Concept activation vectors: a unifying view and adversarial attacks","abstract":"Concept Activation Vectors (CAVs) are a tool from explainable AI, offering a promising approach for understanding how human-understandable concepts are encoded in a model's latent spaces. They are computed from hidden-layer activations of inputs belonging either to a concept class or to non-concept examples. Adopting a probabilistic perspective, the distribution of the (non-)concept inputs induces a distribution over the CAV, making it a random vector in the latent space. This enables us to derive mean and covariance for different types of CAVs, leading to a unified theoretical view. This probabilistic perspective also reveals a potential vulnerability: CAVs can strongly depend on the rather arbitrary non-concept distribution, a factor largely overlooked in prior work. We illustrate this with a simple yet effective adversarial attack, underscoring the need for a more systematic study.","short_abstract":"Concept Activation Vectors (CAVs) are a tool from explainable AI, offering a promising approach for understanding how human-understandable concepts are encoded in a model's latent spaces. They are computed from hidden-layer activations of inputs belonging either to a concept class or to non-concept examples. Adopting a...","url_abs":"https://arxiv.org/abs/2509.22755","url_pdf":"https://arxiv.org/pdf/2509.22755v2","authors":"[\"Ekkehard Schnoor\",\"Malik Tiomoko\",\"Jawher Said\",\"Alex Jung\",\"Wojciech Samek\"]","published":"2025-09-26T09:22:31Z","proceeding":"stat.ML","tasks":"[\"stat.ML\",\"cs.LG\",\"math.PR\"]","methods":"[]","has_code":false}
