{"ID":2827004,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.17375","arxiv_id":"2512.17375","title":"AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens","abstract":"LLM-as-a-Judge systems supply the reward signal in modern RLHF and RLVR pipelines, but their binary verdict reduces to a single linear readout F_gap on one hidden state. We show this readout is shallow enough that short, low-perplexity tokens flip the verdict from \"No\" to \"Yes\". These tokens are sampled from the judge's own next-token distribution at the response position, with no manual seed set and no gradient-based optimization. Our procedure, AdvJudge-Zero, reaches $\u003e$90% ensemble false-positive rate on 22 of 24 (model, dataset) cells across six Qwen, Llama, and Gemma judges, versus 54-72% for the prior curated 10-token benchmark, and the discovered surface transfers cross-format to a 70B scalar reward model. The same discovered pool enables a defense: a LoRA fine-tune stratified by a 9-class mechanism taxonomy hardens cross-family generalization where naive sampling on the same pool fails, with mechanism breadth rather than pool size carrying the gain. Under GRPO training, the hardened judge eliminates the reward-collapse failures (false-positive spikes and length collapse) we observe in the unhardened baseline on both MATH and GSM8K at ten seeds per condition. The discovered pool, the mechanism taxonomy, and per-prompt flip records will be released under responsible disclosure.","short_abstract":"LLM-as-a-Judge systems supply the reward signal in modern RLHF and RLVR pipelines, but their binary verdict reduces to a single linear readout F_gap on one hidden state. We show this readout is shallow enough that short, low-perplexity tokens flip the verdict from \"No\" to \"Yes\". These tokens are sampled from the judge'...","url_abs":"https://arxiv.org/abs/2512.17375","url_pdf":"https://arxiv.org/pdf/2512.17375v2","authors":"[\"Tung-Ling Li\",\"Yuhao Wu\",\"Hongliang Liu\"]","published":"2025-12-19T09:22:11Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.CL\",\"cs.CR\"]","methods":"[\"Large Language Model\",\"RLHF\",\"LoRA\"]","has_code":false}
