{"ID":2833612,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.03994","arxiv_id":"2512.03994","title":"Training-Free Policy Violation Detection via Activation-Space Whitening in LLMs","abstract":"As organizations increasingly deploy LLMs in sensitive domains such as legal, financial, and medical settings, ensuring alignment with internal organizational policies has become a priority. Existing content moderation frameworks remain largely confined to the safety domain and lack the robustness to capture nuanced organizational policies. LLM-as-a-judge and fine-tuning approaches, though flexible, introduce significant latency and training cost. To address these limitations, we frame policy violation detection as an out-of-distribution (OOD) problem in the model's activation space. We propose a training-free method that operates directly on the LLM internal representations, leveraging prior evidence that decision-relevant information is encoded within them. Inspired by whitening techniques, we apply a linear transformation to decorrelate and standardize the model's hidden activations, and use the Euclidean norm in this transformed space as a compliance score for detecting policy violations. Our method requires only the policy text and a small number of illustrative samples, making it lightweight and easily deployable. We extensively evaluate our method across multiple LLMs and challenging policy benchmarks, achieving 86.0% F1 score while outperforming fine-tuned baselines by up to 9.1 points and LLM-as-a-judge by 16 points, with significantly lower computational cost. Code is available at: https://github.com/FujitsuResearch/LLM-policy-violation-detection","short_abstract":"As organizations increasingly deploy LLMs in sensitive domains such as legal, financial, and medical settings, ensuring alignment with internal organizational policies has become a priority. Existing content moderation frameworks remain largely confined to the safety domain and lack the robustness to capture nuanced or...","url_abs":"https://arxiv.org/abs/2512.03994","url_pdf":"https://arxiv.org/pdf/2512.03994v3","authors":"[\"Oren Rachmil\",\"Avishag Shapira\",\"Roy Betser\",\"Itay Gershon\",\"Omer Hofman\",\"Asaf Shabtai\",\"Yuval Elovici\",\"Roman Vainshtein\"]","published":"2025-12-03T17:23:39Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Large Language Model\",\"Generative Adversarial Network\"]","has_code":false,"code_links":[{"ID":606338,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2833612,"paper_url":"https://arxiv.org/abs/2512.03994","paper_title":"Training-Free Policy Violation Detection via Activation-Space Whitening in LLMs","repo_url":"https://github.com/FujitsuResearch/LLM-policy-violation-detection","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
